WO2008007424A1 - Genome analysis system, genome analysis method, and program - Google Patents

Genome analysis system, genome analysis method, and program Download PDF

Info

Publication number
WO2008007424A1
WO2008007424A1 PCT/JP2006/313757 JP2006313757W WO2008007424A1 WO 2008007424 A1 WO2008007424 A1 WO 2008007424A1 JP 2006313757 W JP2006313757 W JP 2006313757W WO 2008007424 A1 WO2008007424 A1 WO 2008007424A1
Authority
WO
WIPO (PCT)
Prior art keywords
state variable
population
equation
update
genome analysis
Prior art date
Application number
PCT/JP2006/313757
Other languages
French (fr)
Japanese (ja)
Inventor
Junji Tanaka
Masato Inoue
Original Assignee
Digital Information Technologies Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Digital Information Technologies Corporation filed Critical Digital Information Technologies Corporation
Priority to PCT/JP2006/313757 priority Critical patent/WO2008007424A1/en
Publication of WO2008007424A1 publication Critical patent/WO2008007424A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Definitions

  • Genome analysis system Genome analysis method, genome analysis method and program
  • the present invention relates to a genome analysis system, an analysis method, and a program for performing analysis for estimating the characteristics of a population and the position of Z or each specimen in the population, particularly from sample data.
  • the genome refers to a set of chromosomes that are indispensable for carrying out life activities.
  • a genome is a compound word made up of a gene and a chromosome.
  • the basis of life is a cell, the cell is surrounded by a cell membrane, the nucleus is surrounded by a nuclear membrane, and the independence of each unit is maintained.
  • Human cells are specialized cell groups that have differentiated functions and forms such as nerve cells, muscle cells, blood cells, immune system cells, epithelial cells that are cells on the surface of skin and tissues, and sensory cells. It is made up of undifferentiated cells called stem cells. Cells have important time-varying aspects. It is to make new cells by dividing cells. Cell division is an important mechanism that enables the transmission and expression of genetic information.
  • chromosomes in the nucleus. These chromosomes are the ones that carry genetic information, and the genes are contained in them. Genes mainly define how proteins are made in the genome.
  • the basic substance that makes up a chromosome is DNA (deoxyribonucleic acid), and genetic information is stored in a sequence of four bases in DNA, A, T, G, and C.
  • Haploid organisms such as bacteria and viruses have a single genome.
  • a diploid organism has two sets of genomes with overlapping genetic information. For example, germ cells such as human eggs and sperm have a set of genomes with 23 chromosomal forces. Somatic cells have two sets of genomes (46 chromosomes)! The human genome consists of about 3 billion DNA base pairs (3000 megabase pairs, 1 megabase is 1 million base pairs), and a single string is about 1 meter long.
  • a genome is a total of gene information existing in a cell, and includes information for controlling genes and gene expression.
  • proteins and genes are so-called products and blueprints, and there are parts on the genome that control and control the production of products in addition to blueprints.
  • the significance of its existence is unknown, but there are also some areas where it seems to have some influence on the maintenance of biological functions. By clarifying these, it is believed that more accurate understanding of life phenomena will be possible.
  • genome analysis is a comprehensive analysis of the genetic information of an organism's genome, and the power to determine the base sequences of DNA molecules (GATC alignment) that make up the genome begins.
  • GATC alignment the base sequences of DNA molecules
  • the nucleotide sequence of about 3 billion pairs of DNA contained in 24 chromosomes (that is, DNA molecules) in total, 22 autosomes, X chromosome, and Y chromosome, is the human genome.
  • the genome information we have is the inherited genome information of the previous parent. Parents' genome information inherits the ancestral power of the previous generation. In this way, by going back to the origin of genetic information one generation ago, we can reach the genome of the first organism 3.8 billion years ago.
  • genome sequence information is input as a genome analysis, and a plurality of (for example, 10) or more identical bases are continuously arranged in the input genome sequence information. If there is a sequence portion, the plurality of the same bases are continuously arranged !, and the sequence portion is continuously arranged in front and rear of the predetermined number of
  • a genome analysis method that extracts base sequence information consisting of bases and outputs the extracted base sequence information.
  • a polymorphic marker for identifying a disease-related candidate gene can be found quickly and efficiently with an accuracy close to that of SNPs without using SNPs (single nucleotide polymorphism). It's like! /
  • Patent Document 1 is a force that is a method of genome analysis that attempts to find polymorphic markers for identifying disease-related candidate genes. It is necessary to analyze the DNA base sequence as well as various viewpoints. Therefore, it has not yet been elucidated, and it is expected that there will be various methods for genome analysis, and it is expected to be elucidated.
  • Patent Document 1 Japanese Patent Laid-Open No. 2003-288346
  • the present invention has been made in view of such a situation, and solves the above problems. It is intended to provide a genome analysis system and analysis method that can estimate population characteristics and z or positioning of each sample in a population from sample data.
  • the present invention has the following configurations to solve the above problems.
  • the gist of the invention of claim 1 is that sample data is taken in
  • Two first state variables and a second state variable are selected which are state variables that characterize the population to which the sample data belongs, or state variables that represent the position of each sample of the sample data in the population. And a convergence means for converging the first state variable and the second state variable to a desired value, and the characteristics of the population and Z or each sample in the population. It exists in the genome-analysis system characterized by having the characteristic estimation means by which positioning is estimated.
  • the gist of the invention described in claim 2 is provided with a taking-in means for taking in sample data and a computing means,
  • the calculation means is a state variable that characterizes a population to which the sample data captured by the capture means belongs, or a state variable that represents a position of each sample of the sample data in the population. Select the state variable 1 and the second state variable,
  • the present invention resides in a genome analysis system characterized by estimating the characteristics of the population and the positioning of Z or each specimen in the population.
  • the gist of the invention described in claim 3 is that an operator uses an update expression embedded with knowledge of genetics (statistics) in which the first state variable and the second state variable are represented by each other. Further comprising conversion means for mutual conversion, and estimation means for estimating the first state variable and the second state variable by a third state variable embedded in the update formula adapted to each of the first state variable and the second state variable It exists in the genome-analysis system of Claim 1 or 2 characterized by the above-mentioned.
  • the gist of the invention described in claim 4 is that the first state variable is an origin population belonging degree of each sample of the sample data, and the second state variable is an origin of the sample data.
  • the gist of the invention described in claim 5 is that the third state variable is a diplotype of each sample of the sample data and its frequency, according to any one of claims 1 to 4. It exists in the genome analysis system.
  • the gist of the invention described in claim 6 is that the first state variable update expression, which is an update expression adapted to the first state variable, is represented by the following expression (1): It exists in the genome analysis system in any one of.
  • the gist of the invention described in claim 7 is that the second state variable update expression, which is an update expression adapted to the second state variable, is represented by the following expression (2): It exists in the genome analysis system in any one of.
  • the gist of the invention described in claim 8 is that the second state variable update expression, which is an update expression adapted to the second state variable, is represented by the following expression (3): It exists in the genome analysis system in any one of.
  • the gist of the invention described in claim 9 is the power of any one of claims 1 to 8, further comprising K optimum solution deriving means for obtaining an optimum solution by using the number K of the origin population as the following equation (4): Exists in the described genome analysis system.
  • the gist of the invention described in claim 10 is characterized in that it further comprises K optimum solution deriving means for obtaining an optimum solution by using the number K of the origin population as the following equation (5).
  • K optimum solution deriving means for obtaining an optimum solution by using the number K of the origin population as the following equation (5).
  • the gist of the invention described in claim 11 is that the update equation for updating the first state variable and the second state variable is expressed by the following equation (6): It exists in the genome analysis system in any one of.
  • the gist of the invention described in claim 12 is a determining means for determining a genetic polymorphism to be investigated
  • a wet process means for determining or estimating an individual's haplotype from allele information determined by the wet process for the genetic polymorphism of the population to be investigated;
  • a feature parameter determining means for determining two feature parameters, which are a feature parameter that characterizes the group and z or a feature parameter that indicates the position of the group in the population;
  • Update formula construction means for constructing an update formula between the two feature parameters from genetic information
  • a feature parameter deriving means for sequentially obtaining the two feature parameters by an update formula
  • Conversion convergence means for repeating conversion until the two feature parameters converge, and by obtaining the two feature parameters, characteristics of the population and
  • the gist of the invention described in claim 13 is an acquisition step of acquiring sample data, a state variable characterizing the population to which the sample data belongs, and a state variable indicating the position of z or each specimen in the population.
  • the gist of the invention described in claim 14 is that the first state variable and the second state variable are mutually expressed by using an update expression in which genetic (statistical) knowledge represented by the other one is embedded as an operator. And a conversion step of converting the first state variable and the second state variable to an estimation step for estimating the first state variable and the second state variable by using the third state variable embedded in the update equation. It exists in the genome-analysis method of Claim 13 characterized by the above-mentioned.
  • the gist of the invention of claim 15 is that the first state variable is an origin population membership degree of each sample of the sample data, and the second state variable is an origin population haplotype frequency of the sample data. 15.
  • the gist of the invention of claim 16 is that the third state variable is each sample data.
  • the gist of the invention described in claim 17 is that the first state variable update expression which is an update expression adapted to the first state variable is expressed by the following expression (1): The genome analysis method described in any of the above.
  • the gist of the invention described in claim 18 is that the second state variable update expression which is an update expression adapted to the second state variable is expressed by the following expression (2): The genome analysis method described in any of the above.
  • the gist of the invention described in claim 19 is that the second state variable update expression, which is an update expression adapted to the second state variable, is represented by the following expression (3): The genome analysis method described in any of the above.
  • the gist of the invention described in claim 20 further includes a K optimum solution derivation step for obtaining an optimum solution using the number K of the origin population as the following equation (4). It exists in the genome analysis method of description. [Equation 4]
  • the gist of the invention described in claim 21 is the K optimal solution derivation step for obtaining an optimal solution with the number K of origin population as the following equation (5): 21.
  • the gist of the invention described in claim 22 is that the update equation for updating the first state variable and the second state variable is expressed by the following equation (6): The genome analysis method described in any of the above.
  • the gist of the invention described in claim 23 is a determination step of determining a genetic polymorphism to be investigated, a wet process step of determining allele information by a wet process of a genetic polymorphism of a population to be investigated,
  • the gist of the invention described in claim 24 resides in a program capable of executing the genome analysis method according to any one of claims 13 to 23.
  • the genome analysis system of the present invention is a state variable that represents the characteristics of the population, and a state variable that represents the position of each sample in the population, for example, the origin population membership of each sample and each source population. It is possible to determine the frequency of haplotypes at a much higher speed than conventional methods by using genotype data and multitype data of multiple loci.
  • FIG. 1 is a diagram for explaining the outline of the genome analysis system using the genome analysis method of the present invention
  • Fig. 2 is a block diagram of the genome analysis system of the present invention
  • Fig. 3 is the genome analysis system of Fig. 1.
  • FIG. 4 is a flow chart showing the genome analysis method of the present invention.
  • the genome analysis system 1 uses the sample data to determine the characteristics of each population and each The position of the sample in the population is estimated and the analysis result is output.
  • Sample data is sampled from a population of broad genomic information represented by genetic polymorphisms.
  • the genome analysis system 1 it is possible to use a notebook computer, desktop computer, or the like equipped with an analysis program for performing calculations for genome analysis described later.
  • the configuration of the genome analysis system of the present invention is as shown in FIG. 2 in the form of determination means'wet process means, capture means, calculation means, selection means, feature parameter determination means, convergence means, conversion means, update formula construction means ⁇ Feature parameter deriving means ⁇ Conversion convergence means ⁇ Feature estimation means ⁇ Estimation means
  • the outline of the analysis by the genome analysis system 1 is a model of an entity that can be characterized by two state variables that characterize a group, as shown in FIG. 3, for example.
  • a state variable that characterizes a population is a statistical statistic derived from the population or each sample.For example, the origin population attribution of each sample, the origin population haplotype frequency, and the individual diplotype frequency. Can be mentioned.
  • State variables include state A, which is the first state, and state B, which is the second state.
  • state A is the origin population attribution of each sample
  • state B is the origin population haplotype frequency. Then, state A and state B are converted to each other using the update expression represented by the other side as an operator. Details of this update expression will be described later.
  • the genome analysis system 1 has a function of estimating three variables representing characteristics of the population to which the sample data belongs or the position of each specimen in the population, that is, the first variable and the second variable.
  • the variable has a loose relationship through the third variable, and has the function of estimating these three variables from the fourth variable that can be observed. For example, as shown in Fig. 3, we focus on the fact that state A and state B can be considered as two aspects of a group. The characteristic parameters are none other than these three variables.
  • the first, second, third, and fourth variables are defined by the following expression (7).
  • the sample is 1, Assume I of 2, ..., I, K of origin 1, 2, ..., K and H of 1, 2, ..., H of haplotypes .
  • the vector may be labeled as b Kh ⁇ -ib k [, P —A, w].
  • a diplotype represents a set of mother-derived and father-derived haplotypes ⁇ ,, 2 ⁇
  • first and second variables can be thought of as two states that characterize the system of interest and are not completely independent but are loosely related via the third variable. . Considering this, the first and second variables in equation (7) above can be considered as update operators that update each other.
  • an update operator adapted to the first variable and the second variable can be derived, and genetic (statistical) knowledge of these update operators can be derived. It is assumed that genetic information is embedded. At this time, if the first variable and the second variable are weakly related to each other, an appropriate initial value is given. It will converge to the feature of positioning in the population.
  • the genetic knowledge is expressed by the following equation (8). This is based on the assumption that the probability that a particular sample is a particular diplotype knows which origin population the sample originated from and also knows the haplotype frequency of that origin population. Below, it is as simple as restoring the haplotype twice from the original population.
  • Equation (9) the overall probability model is expressed as the following equation (10).
  • D is a function that represents 1 if the expression attached to the lower right is correct, and 0 otherwise.
  • equation (10) includes random variables that cannot be observed, let us consider obtaining optimal parameters within the framework of the EM algorithm. Specifically, the optimal parameters that characterize the observation data are estimated according to the following equation (11). That is, the optimal number of origin populations, the haplotype frequency of each population, the probability of which diplotype each sample is from, and the probability of which origin population originated are obtained.
  • the haplotype frequency of each origin population can be obtained using the following two sequential update equations, (12).
  • the estimation algorithm is as follows. First, find the value of a. This is the origin It can be obtained by assuming that the number of population is one. Specifically, using the following sequential update equation, (16)
  • Equation (14) becomes unnecessary because estimation has already been completed in equations (16) and (17).
  • variable update equation can be expressed as in equation (19)
  • optimum K can be expressed as in equation (20).
  • each haplotype is given an ID number h as shown in the following equation (21). There are H types of haplotypes.
  • an ID number is assigned to the observed dienotype (eg, observation data of multiple SNPs), and is set to xi as shown in equation (25). It is assumed that there are I people in all.
  • a possible diplotype type is set as set D (x) as shown in equation (26).
  • D (xi) the dienotypic power of a person in the eye
  • is introduced as shown in the following equation (27). This ⁇ takes a value of 1 if di is present in D (x), and 0 otherwise.
  • the person's dienotype can be uniquely determined as shown in the following equation (29).
  • the dienotype probability for the i-th person can be expressed as the following equation (30).
  • Equation 31 y ⁇ arg max (in P ( ⁇ , ⁇
  • y)) P ( ⁇ mXi) y) arg max Q ( ⁇ xi ⁇ ⁇ y) ⁇ ⁇ ⁇ (3 D [0079]
  • equation (31) is converged to a true value by iterative calculation by iterative substitution. This can be mathematically expressed as the following equation (32).
  • equation (32) the y (t) force is first started, y is repeatedly calculated, and y is converged.
  • the estimation method for a single population force also has a conventional force, but in the present invention, this method is applied to provide a diplotype determination method for samples from a plurality of populations. .
  • This diplotype determination method will be explained in more detail and more easily.
  • each parameter and variable are set.
  • the number of populations is shown.
  • the matrix bk shows the frequency of the kth population who has the haplotype with ID h. Therefore, when this bk is added to all people belonging to the kth population, it is 1.
  • ki is the ID number of the i-th person that the person in the grid belongs to the ki mother group. Equations for each of these parameters and variables are listed in Equation (35).
  • di which is the diplotype of a certain i-th person, can be expressed from ki and the haplotype frequency of the ki population.
  • the equation (37) representing this di is shown below (see the estimation model (28) from a single population).
  • the (30) power in the estimation model from a single population is expanded, and the probability for the i-th person's dienotype can be expressed as the following (38) it can.
  • the required memory order is about O (KH).
  • the genetic polymorphism to be investigated is determined (step S1 'determination step).
  • allele information from the genetic polymorphism wet process of the population to be investigated is a process of determining genomic information such as genetic polymorphism of a sample using a DNA sequencer or the like.
  • haplotypes of individuals are determined or estimated from allele information (step S3 ⁇ haplotype estimation process).
  • step S4 two loosely related feature parameters representing the group are determined.
  • the origin population membership of the sample and the haplotype frequency of each origin population are used as two feature parameters.
  • an update operator between the two feature parameters is constructed from the genetic information and the third parameter (step S5 'update formula construction process).
  • the third parameter here is the individual diplotype and its frequency.
  • Embedding genetic (statistical) knowledge, that is, genetic information, in the update operator means adopting the diplotype and frequency of the individual, which is genetic (statistical) knowledge, and information as the third parameter. It is none other than.
  • step S6 two feature parameters are obtained in turn by an update operator
  • step S7 conversion convergence step
  • step S8 Two feature parameters are then obtained (step S8).
  • Updating feature parameters using an update formula is nothing but updating two feature parameters by obtaining two feature parameters in turn using this update operator, and alternately deriving one force and the other.
  • Converging the parameters by this update means converging the state variable to the original value, that is, approximating the true value.
  • Figures 5 to 9 show the results of genome analysis using an update operator that uses multilocus genotype data and neuroprotype data to infer the origin population and assign each sample to the origin population. It is a figure which shows an example of the obtained analysis result.
  • the fast grouping algorithm is the analysis method of the present invention.
  • the haplotype is considered to be more powerful gene information than the allele, and the haplotype is used instead of the allele as the gene information used in the analysis.
  • the haplotype frequency bk of the origin population and the degree of membership cik of the sample to the origin population are adopted as two state variables characterizing the population.
  • the characteristics of the population to which the sampled individuals belong can be estimated.
  • the third variable linking the two state variables is the individual diplotype and its frequency ai, di, the data observed as the fourth variable, ie, the dienotype information xi. It was adopted.
  • ai and di are obtained from observation data X. Specifically, put an appropriate initial value in y, calculate (45) and (46) in order, and continue about 100 times until the value converges.
  • equation (54) can be used instead of equation (53). (See equation (49) below) [0128] [Numerical equation 49] li ... ( 5 4
  • FIG. 5 shows the difference in execution time between the present embodiment of the structure analysis program and the MCMC method. As shown in FIG. 4, the method of the present invention can output the result at a much higher speed than the conventional method.
  • Fig. 6 shows the haplotype frequency results of the two origin populations estimated by this example.
  • Fig. 7 shows the result of cik belonging to the origin population of the sample estimated by this example: cik.
  • Fig. 8 shows the ratio of the estimation accuracy for various data of this example, MCMC, and cluster method. It is a comparison result. In the method of the present invention, the estimation is performed with higher accuracy than the conventional method.
  • FIG. 9 is an example of the result of the estimated number of origin populations in this example.
  • analysis for estimating the characteristics of a population from sample data can be performed at a higher speed and with respect to more samples.
  • FIG. 1 An explanatory diagram of the outline of a genome analysis system used in the genome analysis method of the present invention.
  • FIG. 2 is a block diagram of the genome analysis system of the present invention.
  • FIG. 3 is a diagram for explaining the outline of analysis by the genome analysis system of FIG. 1.
  • FIG. 4 is a flowchart showing the genome analysis method of the present invention.
  • FIG. 5 is a comparison of the execution time of the genome analysis method of the present invention and the MCMC method.
  • FIG. 8 Comparison of origin population estimation results between the present invention, MCMC method, and cluster one method.

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Complex Calculations (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An analysis system and method for making an analysis to estimate the feature of a population by using sampled data. Sample data is captured. Knowledge of the genetics (statistics) is embedded in an update expression of two first and second state variables featuring a group. By using the update expression, update is repeated to converge the first and second state variables to the appropriate values that should be. Thereby the feature parameter of the population to which the sample data belongs and/or the feature parameter representing the location in the population of each sample is estimated, and the result of the estimation of the feature of the population and/or the location in the population of each sample can be outputted.

Description

明 細 書  Specification
ゲノム解析システム、ゲノム解析方法及びプログラム  Genome analysis system, genome analysis method and program
技術分野  Technical field
[oooi] 本発明は、特にサンプルデータにより母集団の特徴及び Z又は各標本の母集団の 中での位置付けを推定するための解析を行うゲノム解析システム、解析方法及びプ ログラムに関する。  [oooi] The present invention relates to a genome analysis system, an analysis method, and a program for performing analysis for estimating the characteristics of a population and the position of Z or each specimen in the population, particularly from sample data.
背景技術  Background art
[0002] 地球上に存在する殆ど全ての生物は細胞力 構成されていて、その細胞一個一個 に遺伝子情報を記録したゲノムが存在している。細胞は構造の違いにより、原核細胞 と真核細胞とに分類される。バクテリアやラン藻のような原核細胞でのゲノムは細胞内 に仕切りのな!、状態で存在して!/、るが、動植物のような真核細胞でのゲノムは核膜で 囲まれた核の中に存在して!/、る。  [0002] Almost all living organisms existing on the earth are composed of cellular force, and each cell has a genome that records genetic information. Cells are classified into prokaryotic cells and eukaryotic cells depending on the structure. Genomes in prokaryotic cells such as bacteria and cyanobacteria are not partitioned in the cells! Exist in a state! /, But genomes in eukaryotic cells such as animals and plants are surrounded by a nuclear membrane. Exist in the! /
[0003] つまり、ゲノムとは生命活動を営むために欠かすことのできない染色体の一組の集 まりを指すものである。また、ゲノム(genome)は、遺伝子(gene)と染色体(chromosom e)からできた複合語である。  [0003] In other words, the genome refers to a set of chromosomes that are indispensable for carrying out life activities. A genome is a compound word made up of a gene and a chromosome.
[0004] ここで、生命の基本は細胞であり、その細胞は細胞膜で囲まれ、核は核膜で囲まれ 、それぞれの単位の独立性が保たれている。ヒトの細胞は、神経細胞、筋細胞、血球 ,免疫系細胞、皮膚や組織の表面の細胞である上皮細胞、感覚細胞等の機能や形 態が分化し、特殊化した細胞群と、それらのもとになる幹細胞といわれる未分化の細 胞とからできている。細胞には重要な、時間的に変化する側面がある。それは、細胞 分裂して新しい細胞を作ることである。細胞分裂は、遺伝子情報の伝達と発現を可能 にする重要な仕組みである。  Here, the basis of life is a cell, the cell is surrounded by a cell membrane, the nucleus is surrounded by a nuclear membrane, and the independence of each unit is maintained. Human cells are specialized cell groups that have differentiated functions and forms such as nerve cells, muscle cells, blood cells, immune system cells, epithelial cells that are cells on the surface of skin and tissues, and sensory cells. It is made up of undifferentiated cells called stem cells. Cells have important time-varying aspects. It is to make new cells by dividing cells. Cell division is an important mechanism that enables the transmission and expression of genetic information.
[0005] 核の中に染色体がある。その染色体こそが、遺伝子情報を担っているもので、遺伝 子はその中に含まれている。遺伝子は、ゲノムの中で主にタンパク質の作り方を定義 している。染色体を構成している基本物質は DNA (デォキシリボ核酸)で、遺伝情報 は DNAの中の四つの塩基、 A、 T、 G、 Cの並びで保存されている。バクテリアやウイ ルスのような 1倍体の生物は、一個のゲノムを持っている。 [0006] 二倍体の生物は遺伝情報が重複する 2組のゲノムを持っている。たとえばヒトの卵 子や精子のような生殖細胞は 23本の染色体力もなる 1組のゲノムを持っている。体細 胞では 2組のゲノム(46本の染色体)を持って!/、る。ヒトのゲノムは約 30億個の DNA の塩基対(3000メガ塩基対、 1メガは 100万塩基対)から成り立つていて、 1本の紐に すると約 1メートルの長さになる。 [0005] There are chromosomes in the nucleus. These chromosomes are the ones that carry genetic information, and the genes are contained in them. Genes mainly define how proteins are made in the genome. The basic substance that makes up a chromosome is DNA (deoxyribonucleic acid), and genetic information is stored in a sequence of four bases in DNA, A, T, G, and C. Haploid organisms such as bacteria and viruses have a single genome. [0006] A diploid organism has two sets of genomes with overlapping genetic information. For example, germ cells such as human eggs and sperm have a set of genomes with 23 chromosomal forces. Somatic cells have two sets of genomes (46 chromosomes)! The human genome consists of about 3 billion DNA base pairs (3000 megabase pairs, 1 megabase is 1 million base pairs), and a single string is about 1 meter long.
[0007] ゲノムは、細胞の中に存在する遺伝子情報の総体であり、そこには遺伝子と遺伝子 の発現を制御する情報等が含まれている。ここで、タンパク質及び遺伝子は、いわば 製品とその設計図であり、ゲノム上には設計図の他に製品の製造を管理'制御してい る部分が存在することになる。また、現在ではその存在意義が不明であるが、生物の 機能維持に何らかの影響を及ぼしていると考えられる領域も力なりの割合で存在して いる。これらを明らかにしていくことによって、生命現象のより正確な把握が可能にな ると考えられている。  [0007] A genome is a total of gene information existing in a cell, and includes information for controlling genes and gene expression. Here, proteins and genes are so-called products and blueprints, and there are parts on the genome that control and control the production of products in addition to blueprints. At present, the significance of its existence is unknown, but there are also some areas where it seems to have some influence on the maintenance of biological functions. By clarifying these, it is believed that more accurate understanding of life phenomena will be possible.
[0008] こうしたことから、ヒトゲノムと呼ばれるヒトのゲノム全塩基配列を解析する「ヒトゲノム 解析計画」に加え、ゲノムの塩基配列を決定するプロジェクトが様々な生物を対象と して研究されている。そして、遺伝子とタンパク質との三位一体の研究により、高度な 生命現象の把握が期待されることになる。  [0008] For this reason, in addition to the “human genome analysis project” that analyzes the entire human genome sequence called the human genome, projects that determine the nucleotide sequence of the genome are being studied for various organisms. A high level of understanding of life phenomena is expected through the trinity of gene and protein research.
[0009] それにはまず、遺伝子間のネットワークが分力 なければならないと考えられる。つ まり、複数のタンパク質がネットワークを形成していて、それらのタンパク質群が特定 の機能を発揮しているからである。そのため、どのような機能や情報のやり取りが行わ れて 、るのかを研究して 、けば、未知の機能を持つ遺伝子が見つ力る力もしれな!ヽ  [0009] First of all, the network between genes must be divided. In other words, multiple proteins form a network, and these proteins exhibit specific functions. Therefore, if you study what functions and information are exchanged, you may have the power to find genes with unknown functions!
[0010] ここで、ゲノム解析とは、生物のゲノムの持つ遺伝情報を総合的に解析することであ り、ゲノムを構成する DNA分子の塩基配列(GATCの並び)を決めること力も始まる。 しかし、塩基配列データ力 だけでは、どこにどのような遺伝子があるのかは簡単に は分力もない。そこで、転写'翻訳によって作られるメッセンジャー RNAやタンパク質 等の遺伝子産物の解析、生物種間で塩基配列がどれだけ似ているか等の比較、さら に大腸菌や出芽酵母等の実験生物で解析された個々の遺伝子に関するデータ等を 基に解析が進められている。 [0011] ちなみに、ヒトの場合、常染色体 22本と X染色体、 Y染色体の計 24本の染色体 (つ まり DNA分子)に含まれる約 30億対の DN Aの塩基配列力 ヒトゲノムである。我々 の持っているゲノム情報は、一代前の親のゲノム情報を受け継いだものである。親の 持つゲノム情報は、さらに一代前の先祖力も受け継いだものである。このように、さら に一代前と遺伝情報の起源をさかのぼることにより、 38億年前の最初の生物のゲノム にたどりつくことができる。 [0010] Here, genome analysis is a comprehensive analysis of the genetic information of an organism's genome, and the power to determine the base sequences of DNA molecules (GATC alignment) that make up the genome begins. However, it is not easy to determine where and what genes exist based on the ability of base sequence data alone. Therefore, analysis of gene products such as messenger RNA and proteins produced by transcription and translation, comparison of how similar the base sequences between species, and individual analysis in experimental organisms such as E. coli and budding yeast Analysis is being carried out based on data related to genes. [0011] By the way, in the case of humans, the nucleotide sequence of about 3 billion pairs of DNA contained in 24 chromosomes (that is, DNA molecules) in total, 22 autosomes, X chromosome, and Y chromosome, is the human genome. The genome information we have is the inherited genome information of the previous parent. Parents' genome information inherits the ancestral power of the previous generation. In this way, by going back to the origin of genetic information one generation ago, we can reach the genome of the first organism 3.8 billion years ago.
[0012] ゲノム解析を行うものとして、特許文献 1では、ゲノム配列情報を入力し、入力され たゲノム配列情報内に、同一の塩基が複数個(たとえば 10個)以上連続して配列さ れている配列部分があるかどうかを判断し、あった場合にその同一の塩基が複数個 以上連続して配列されて!、る配列部分の前方及び後方に連続して配列されて 、る 所定数の塩基からなる塩基配列情報を抽出し、抽出された塩基配列情報を出力する ようにしたゲノム解析方法を提案して 、る。  [0012] In Patent Document 1, genome sequence information is input as a genome analysis, and a plurality of (for example, 10) or more identical bases are continuously arranged in the input genome sequence information. If there is a sequence portion, the plurality of the same bases are continuously arranged !, and the sequence portion is continuously arranged in front and rear of the predetermined number of We propose a genome analysis method that extracts base sequence information consisting of bases and outputs the extracted base sequence information.
[0013] このようなゲノム解析方法により、 SNPs (single nucleotide polymorphism)を用いる ことなく SNPsに近い精度で迅速にかつ効率的に疾患関連候補遺伝子を同定するた めの多型マーカーを見つけ出すことができるようになって!/、る。  [0013] By using such a genome analysis method, a polymorphic marker for identifying a disease-related candidate gene can be found quickly and efficiently with an accuracy close to that of SNPs without using SNPs (single nucleotide polymorphism). It's like! /
[0014] ところで、特許文献 1に示されたものは、疾患関連候補遺伝子を同定するための多 型マーカーを見つけ出すようにしたゲノム解析の一手法である力 ゲノム解析では時 に約 30億対の DNAの塩基配列を 、ろ 、ろな観点力も解析する必要がある。そのた め、未だ解明されて 、な 、様々なゲノム解析を行う手法が存在して 、るものと予測さ れることから、その解明が待たれている。  [0014] By the way, what is shown in Patent Document 1 is a force that is a method of genome analysis that attempts to find polymorphic markers for identifying disease-related candidate genes. It is necessary to analyze the DNA base sequence as well as various viewpoints. Therefore, it has not yet been elucidated, and it is expected that there will be various methods for genome analysis, and it is expected to be elucidated.
特許文献 1:特開 2003 - 288346号公報  Patent Document 1: Japanese Patent Laid-Open No. 2003-288346
発明の開示  Disclosure of the invention
発明が解決しょうとする課題  Problems to be solved by the invention
[0015] こうした従来のゲノム解析手法においては、単一母集団力 のサンプリングに対す る最尤推定からのディプロタイプ決定は存在していたものの、複数母集団からのサン プルに対しては最尤推定力ものディプロタイプ決定が困難であるという問題点があつ た。 [0015] In these conventional genome analysis methods, although diplotype determination from maximum likelihood estimation for sampling of a single population force existed, the maximum likelihood for samples from multiple populations existed. There was a problem that it was difficult to determine the diplotype of the estimated power.
[0016] 本発明は、このような状況に鑑みてなされたものであり、上記問題点を解決する、サ ンプルデータより母集団の特徴及び z又は各標本の母集団の中での位置付けを推 定することができるゲノム解析システム及び解析方法を提供するものである。 [0016] The present invention has been made in view of such a situation, and solves the above problems. It is intended to provide a genome analysis system and analysis method that can estimate population characteristics and z or positioning of each sample in a population from sample data.
課題を解決するための手段 Means for solving the problem
本発明は上記課題を解決すベぐ以下に掲げる構成とした。  The present invention has the following configurations to solve the above problems.
請求項 1記載の発明の要旨は、サンプルデータが取り込まれ、  The gist of the invention of claim 1 is that sample data is taken in,
該サンプルデータが属する母集団を特徴付ける状態変数又は前記サンプルデー タの各標本の前記母集団の中での位置付けを表す状態変数である、二つの第 1の 状態変数及び第 2の状態変数が選択される選択手段と、該第 1の状態変数及び第 2 の状態変数を本来あるべき値に収束させる収束手段によって、前記母集団の特徴及 び Z又は前記各標本の前記母集団の中での位置付けが推定される特徴推定手段 を有することを特徴とするゲノム解析システムに存する。  Two first state variables and a second state variable are selected which are state variables that characterize the population to which the sample data belongs, or state variables that represent the position of each sample of the sample data in the population. And a convergence means for converging the first state variable and the second state variable to a desired value, and the characteristics of the population and Z or each sample in the population. It exists in the genome-analysis system characterized by having the characteristic estimation means by which positioning is estimated.
請求項 2記載の発明の要旨は、サンプルデータを取り込む取込手段と、演算手段と を備え、  The gist of the invention described in claim 2 is provided with a taking-in means for taking in sample data and a computing means,
該演算手段は、前記取込手段により取り込まれた前記サンプルデータが属する母 集団を特徴付ける状態変数又は前記サンプルデータの各標本の前記母集団の中で の位置付けを表す状態変数である、二つの第 1の状態変数及び第 2の状態変数を選 択し、  The calculation means is a state variable that characterizes a population to which the sample data captured by the capture means belongs, or a state variable that represents a position of each sample of the sample data in the population. Select the state variable 1 and the second state variable,
該第 1の状態変数及び第 2の状態変数を本来あるべき値に収束し、  Converge the first state variable and the second state variable to their intended values;
前記母集団の特徴及び Z又は前記各標本の前記母集団の中での位置付けを推 定することを特徴とするゲノム解析システムに存する。  The present invention resides in a genome analysis system characterized by estimating the characteristics of the population and the positioning of Z or each specimen in the population.
請求項 3記載の発明の要旨は、前記第 1の状態変数及び前記第 2の状態変数が互 いに他の一方で表される遺伝 (統計)学の知識を埋め込んだ更新式を演算子として 相互に変換される変換手段と、前記第 1の状態変数及び前記第 2の状態変数がそれ ぞれに適応する前記更新式に埋め込んだ第 3の状態変数により推定される推定手段 とをさらに有することを特徴とする請求項 1又は 2に記載のゲノム解析システムに存す る。  The gist of the invention described in claim 3 is that an operator uses an update expression embedded with knowledge of genetics (statistics) in which the first state variable and the second state variable are represented by each other. Further comprising conversion means for mutual conversion, and estimation means for estimating the first state variable and the second state variable by a third state variable embedded in the update formula adapted to each of the first state variable and the second state variable It exists in the genome-analysis system of Claim 1 or 2 characterized by the above-mentioned.
請求項 4記載の発明の要旨は、前記第 1の状態変数が前記サンプルデータの各サ ンプルの起源母集団帰属度であり、前記第 2の状態変数が前記サンプルデータの起 源母集団ハプロタイプ頻度であることを特徴とする請求項 1乃至 3のいずれかに記載 のゲノム解析システムに存する。 The gist of the invention described in claim 4 is that the first state variable is an origin population belonging degree of each sample of the sample data, and the second state variable is an origin of the sample data. The genome analysis system according to any one of claims 1 to 3, wherein the frequency is a source population haplotype frequency.
請求項 5記載の発明の要旨は、前記第 3の状態変数が前記サンプルデータの各サ ンプルのディプロタイプ及びその頻度であることを特徴とする請求項 1乃至 4のいず れかに記載のゲノム解析システムに存する。  The gist of the invention described in claim 5 is that the third state variable is a diplotype of each sample of the sample data and its frequency, according to any one of claims 1 to 4. It exists in the genome analysis system.
請求項 6記載の発明の要旨は、前記第 1の状態変数に適応する更新式である第 1 状態変数更新式が下記の(1)式で表されることを特徴とする請求項 1乃至 5のいずれ かに記載のゲノム解析システムに存する。  The gist of the invention described in claim 6 is that the first state variable update expression, which is an update expression adapted to the first state variable, is represented by the following expression (1): It exists in the genome analysis system in any one of.
[数 1]
Figure imgf000007_0001
[Number 1]
Figure imgf000007_0001
請求項 7記載の発明の要旨は、前記第 2の状態変数に適応する更新式である第 2 状態変数更新式が下記の(2)式で表されることを特徴とする請求項 1乃至 6のいずれ かに記載のゲノム解析システムに存する。  The gist of the invention described in claim 7 is that the second state variable update expression, which is an update expression adapted to the second state variable, is represented by the following expression (2): It exists in the genome analysis system in any one of.
[数 2] s w Π丄丄ゾ2がり [Number 2] sw Π丄丄zone 2 shy
it 、フ ( , ム
Figure imgf000007_0002
]^丄 j ki A,j
it, fu ( , mu
Figure imgf000007_0002
] ^ 丄 j k i A, j
請求項 8記載の発明の要旨は、前記第 2の状態変数に適応する更新式である第 2 状態変数更新式が下記の(3)式で表されることを特徴とする請求項 1乃至 7のいずれ かに記載のゲノム解析システムに存する。  The gist of the invention described in claim 8 is that the second state variable update expression, which is an update expression adapted to the second state variable, is represented by the following expression (3): It exists in the genome analysis system in any one of.
[数 3] [Equation 3]
(t) 0 , し …(t) 0 , and…
Figure imgf000007_0003
請求項 9記載の発明の要旨は、起源母集団の数 Kを下記の (4)式として最適解を 求める K最適解導出手段をさらに有することを特徴とする請求項 1乃至 8のいずれ力 に記載のゲノム解析システムに存する。
Figure imgf000007_0003
The gist of the invention described in claim 9 is the power of any one of claims 1 to 8, further comprising K optimum solution deriving means for obtaining an optimum solution by using the number K of the origin population as the following equation (4): Exists in the described genome analysis system.
[数 4] [Equation 4]
K = arg max / 、". j〉 c , \n b, , - I\n K ( 4 ) κ  K = arg max /, ".j> c, \ n b,,-I \ n K (4) κ
請求項 10記載の発明の要旨は、起源母集団の数 Kを下記の(5)式として最適解を 求める K最適解導出手段をさらに有することを特徴とする請求項 1乃至 9のいずれか に記載のゲノム解析システムに存する。  The gist of the invention described in claim 10 is characterized in that it further comprises K optimum solution deriving means for obtaining an optimum solution by using the number K of the origin population as the following equation (5). Exists in the described genome analysis system.
[数 5]
Figure imgf000008_0001
[Equation 5]
Figure imgf000008_0001
請求項 11記載の発明の要旨は、前記第 1の状態変数と前記第 2の状態変数を更 新する更新式が下記の(6)式で表されることを特徴とする請求項 1乃至 10のいずれ かに記載のゲノム解析システムに存する。  The gist of the invention described in claim 11 is that the update equation for updating the first state variable and the second state variable is expressed by the following equation (6): It exists in the genome analysis system in any one of.
[数 6] [Equation 6]
Figure imgf000008_0002
~ 請求項 12記載の発明の要旨は、調査する遺伝子多型の決定が行われる決定手段 と、
Figure imgf000008_0002
The gist of the invention described in claim 12 is a determining means for determining a genetic polymorphism to be investigated;
調査したい集団の遺伝子多型についてウエットプロセスによって決定されたアレル 情報より個人のハプロタイプの決定、又は推定が行われるウエットプロセス手段と、 集団を特徴付ける特徴パラメータ及び z又は該集団の母集団の中での位置付けを 表す特徴パラメータである、二つの特徴パラメータの決定が行われる特徴パラメータ 決定手段と、 A wet process means for determining or estimating an individual's haplotype from allele information determined by the wet process for the genetic polymorphism of the population to be investigated; A feature parameter determining means for determining two feature parameters, which are a feature parameter that characterizes the group and z or a feature parameter that indicates the position of the group in the population;
遺伝情報より前記二つの特徴パラメータ間の更新式が構築される更新式構築手段 と、  Update formula construction means for constructing an update formula between the two feature parameters from genetic information;
所定の初期値より始め、更新式により前記二つの特徴パラメータが順番に求められ る特徴パラメータ導出手段と、  Starting from a predetermined initial value, a feature parameter deriving means for sequentially obtaining the two feature parameters by an update formula;
前記二つの特徴パラメータが収束するまで変換を繰り返す変換収束手段とを有し、 前記二つの特徴パラメータが求まることで、サンプルデータより母集団の特徴及び Conversion convergence means for repeating conversion until the two feature parameters converge, and by obtaining the two feature parameters, characteristics of the population and
Z又は前記各標本の前記母集団の中での位置付けが推定されることを特徴とする請 求項 1乃至 11の 、ずれかに記載のゲノム解析システムに存する。 13. The genome analysis system according to any one of claims 1 to 11, wherein the positioning of Z or each sample in the population is estimated.
請求項 13記載の発明の要旨は、サンプルデータを取り込む取込工程と、 該サンプルデータが属する母集団を特徴付ける状態変数及び z又は前記各標本 の前記母集団の中での位置付けを表す状態変数である、二つの第 1の状態変数及 び第 2の状態変数を選択し、該第 1の状態変数及び第 2の状態変数を本来あるべき 値に収束させる収束工程によって、前記母集団の特徴及び Z又は前記各標本の前 記母集団の中での位置付けを推定する特徴推定工程を有することを特徴とするゲノ ム解析方法に存する。  The gist of the invention described in claim 13 is an acquisition step of acquiring sample data, a state variable characterizing the population to which the sample data belongs, and a state variable indicating the position of z or each specimen in the population. By selecting a first state variable and a second state variable, and by converging the first state variable and the second state variable to their original values, the characteristics of the population and The present invention resides in a genome analysis method comprising a feature estimation step for estimating the position of Z or each sample in the population.
請求項 14記載の発明の要旨は、前記第 1の状態変数及び前記第 2の状態変数が 互いに他の一方で表される遺伝 (統計)学の知識を埋め込んだ更新式を演算子とし て相互に変換を行う変換工程と、前記第 1の状態変数及び前記第 2の状態変数をそ れぞれに適応する前記更新式に埋め込んだ第 3の状態変数により推定する推定ェ 程とをさらに有することを特徴とする請求項 13に記載のゲノム解析方法に存する。 請求項 15記載の発明の要旨は、前記第 1の状態変数が前記サンプルデータの各 サンプルの起源母集団帰属度であり、前記第 2の状態変数が前記サンプルデータの 起源母集団ハプロタイプ頻度であることを特徴とする請求項 13又は 14に記載のゲノ ム解析方法に存する。  The gist of the invention described in claim 14 is that the first state variable and the second state variable are mutually expressed by using an update expression in which genetic (statistical) knowledge represented by the other one is embedded as an operator. And a conversion step of converting the first state variable and the second state variable to an estimation step for estimating the first state variable and the second state variable by using the third state variable embedded in the update equation. It exists in the genome-analysis method of Claim 13 characterized by the above-mentioned. The gist of the invention of claim 15 is that the first state variable is an origin population membership degree of each sample of the sample data, and the second state variable is an origin population haplotype frequency of the sample data. 15. The genomic analysis method according to claim 13 or 14, characterized in that:
請求項 16記載の発明の要旨は、前記第 3の状態変数が前記サンプルデータの各 サンプルのディプロタイプ及びその頻度であることを特徴とする請求項 14又は 15に 記載のゲノム解析方法に存する。 The gist of the invention of claim 16 is that the third state variable is each sample data. The genome analysis method according to claim 14 or 15, which is a diplotype of the sample and its frequency.
請求項 17記載の発明の要旨は、前記第 1の状態変数に適応する更新式である第 1状態変数更新式が下記の(1)式で表されることを特徴とする請求項 13乃至 16のい ずれかに記載のゲノム解析方法に存する。  The gist of the invention described in claim 17 is that the first state variable update expression which is an update expression adapted to the first state variable is expressed by the following expression (1): The genome analysis method described in any of the above.
[数 1] r(/+l) yvi )y dieD(x a Y2s d1J=h
Figure imgf000010_0001
[Equation 1] r (/ + l) yvi) y dieD (xa Y 2 sd 1J = h
Figure imgf000010_0001
請求項 18記載の発明の要旨は、前記第 2の状態変数に適応する更新式である第 2状態変数更新式が下記の(2)式で表されることを特徴とする請求項 13乃至 17のい ずれかに記載のゲノム解析方法に存する。  The gist of the invention described in claim 18 is that the second state variable update expression which is an update expression adapted to the second state variable is expressed by the following expression (2): The genome analysis method described in any of the above.
[数 2] [Equation 2]
_( s 、 Π] Aj2が%りd_ (s, Π] Aj 2 % d
— y τ 2 έω Δ ) - y τ 2 έ ω Δ)
^^ el) ( )丄丄 j kiA ^^ el) () 丄 丄 j k iA
請求項 19記載の発明の要旨は、前記第 2の状態変数に適応する更新式である第 2状態変数更新式が下記の(3)式で表されることを特徴とする請求項 13乃至 18のい ずれかに記載のゲノム解析方法に存する。  The gist of the invention described in claim 19 is that the second state variable update expression, which is an update expression adapted to the second state variable, is represented by the following expression (3): The genome analysis method described in any of the above.
[数 3] y[Equation 3] y
) = ム eD(xi) Tl2b(t) ) = Mu eD (xi) Tl 2 b (t)
(t j , 0 , (tj, 0 ,
¾ "y Π2がり ¾ "y Π 2 bit
ki )丄丄 j ki A j ki) 丄 丄 j k i A j
請求項 20記載の発明の要旨は、起源母集団の数 Kを下記の (4)式として最適解を 求める K最適解導出工程をさらに有することを特徴とする請求項 13乃至 19のいずれ かに記載のゲノム解析方法に存する。 [数 4] The gist of the invention described in claim 20 further includes a K optimum solution derivation step for obtaining an optimum solution using the number K of the origin population as the following equation (4). It exists in the genome analysis method of description. [Equation 4]
K = ar max / 、 c , K = ar max /, c,
κ V"ln ), , - I\n K ( 4 ) 請求項 21記載の発明の要旨は、起源母集団の数 Kを下記の(5)式として最適解を 求める K最適解導出工程をさらに有することを特徴とする請求項 13乃至 20のいずれ かに記載のゲノム解析方法に存する。  κ V "ln),,-I \ n K (4) The gist of the invention described in claim 21 is the K optimal solution derivation step for obtaining an optimal solution with the number K of origin population as the following equation (5): 21. The genome analysis method according to any one of claims 13 to 20, wherein the genome analysis method comprises:
[数 5]
Figure imgf000011_0001
[Equation 5]
Figure imgf000011_0001
請求項 22記載の発明の要旨は、前記第 1の状態変数と前記第 2の状態変数を更 新する更新式が下記の(6)式で表されることを特徴とする請求項 13乃至 21のいずれ かに記載のゲノム解析方法に存する。  The gist of the invention described in claim 22 is that the update equation for updating the first state variable and the second state variable is expressed by the following equation (6): The genome analysis method described in any of the above.
[数 6] [Equation 6]
( 6(6
Figure imgf000011_0002
Figure imgf000011_0002
請求項 23記載の発明の要旨は、調査する遺伝子多型の決定を行う決定工程と、 調査したい集団の遺伝子多型のウエットプロセスによるアレル情報の決定を行うゥェ ットプロセス工程と、 The gist of the invention described in claim 23 is a determination step of determining a genetic polymorphism to be investigated, a wet process step of determining allele information by a wet process of a genetic polymorphism of a population to be investigated,
前記アレル情報より個人のハプロタイプの決定、又は推定を行うハプロタイプ推定 工程と、  A haplotype estimation step of determining or estimating an individual haplotype from the allele information; and
集団を特徴付ける二つの特徴パラメータの決定を行う特徴パラメータ決定工程と、 遺伝情報より前記二つの特徴パラメータ間の更新式を構築する更新式構築工程と 所定の初期値より始め、更新式により前記二つの特徴パラメータを順番に求める特 徴パラメータ導出工程と、 前記二つの特徴パラメータが収束するまで変換を繰り返す変換収束工程とを有し、 前記二つの特徴パラメータが求まることで、前記サンプルデータより母集団の特徴 及び Z又は前記各標本の前記母集団の中での位置付けが推定されることを特徴と する請求項 13乃至 22のいずれかに記載のゲノム解析方法に存する。 A feature parameter determining step for determining two feature parameters characterizing a group, and an update formula construction step for constructing an update formula between the two feature parameters from genetic information; Starting with a predetermined initial value, and having a feature parameter deriving step for sequentially obtaining the two feature parameters by an update formula, and a conversion convergence step for repeating the conversion until the two feature parameters converge, the two features The genome according to any one of claims 13 to 22, wherein a parameter is obtained, and a feature of the population and a positioning of Z or each specimen in the population are estimated from the sample data. It exists in the analysis method.
請求項 24記載の発明の要旨は、請求項 13乃至 23のいずれかに記載のゲノム解 析方法を実行可能なプログラムに存する。  The gist of the invention described in claim 24 resides in a program capable of executing the genome analysis method according to any one of claims 13 to 23.
発明の効果  The invention's effect
[0018] 本発明のゲノム解析システムは、母集団の特徴を表す状態変数や各標本の母集 団の中での位置付けを表す状態変数、例えばサンプルの起源母集団帰属度と各起 源母集団のハプロタイプ頻度を、複数座位の遺伝子型データ及びノヽプロタイプデー タを使用して従来の方法よりも非常に高速に求めることが可能である。  [0018] The genome analysis system of the present invention is a state variable that represents the characteristics of the population, and a state variable that represents the position of each sample in the population, for example, the origin population membership of each sample and each source population. It is possible to determine the frequency of haplotypes at a much higher speed than conventional methods by using genotype data and multitype data of multiple loci.
[0019] また、従来の方法よりも高精度に起源母集団を推定し、かつ各サンプルを起源母 集団に割り当てることが可能であり、一度に決定できるサンプル数についても、従来 の方法では 20程度が限界であった力 本発明の方法においては、より多数のサンプ ルにつ 、て結果を一度に求めることが可能となる。  [0019] In addition, it is possible to estimate the source population with higher accuracy than the conventional method and assign each sample to the source population, and the number of samples that can be determined at one time is about 20 in the conventional method. In the method of the present invention, the results can be obtained at once for a larger number of samples.
[0020] また従来のゲノム解析手法においては困難であった、複数母集団からのサンプル に対しても起源母集団の推定、各サンプルの起源母集団への割り当てが可能となる  [0020] In addition, it is possible to estimate the source population and assign each sample to the source population, even for samples from multiple populations, which was difficult with conventional genome analysis methods.
発明を実施するための最良の形態 BEST MODE FOR CARRYING OUT THE INVENTION
[0021] 以下、本発明の実施の形態について説明する。 Hereinafter, embodiments of the present invention will be described.
図 1は、本発明のゲノム解析方法を用いたゲノム解析システムの概要を説明するた めの図、図 2は本発明のゲノム解析システムの構成図、図 3は、図 1のゲノム解析シス テムによる解析の概要を説明するための図、図 4は、本発明のゲノム解析方法を示す フローチャートである。  Fig. 1 is a diagram for explaining the outline of the genome analysis system using the genome analysis method of the present invention, Fig. 2 is a block diagram of the genome analysis system of the present invention, and Fig. 3 is the genome analysis system of Fig. 1. FIG. 4 is a flow chart showing the genome analysis method of the present invention.
[0022] 図 1に示すように、ゲノム解析システム 1は、サンプルデータより母集団の特徴や各 標本の母集団の中での位置付けを推定し、その解析結果を出力するものである。サ ンプルデータとは、遺伝子多型などに代表される広義のゲノム情報を母集団よりサン プリングしたものである。ゲノム解析システム 1としては、後述のゲノム解析のための演 算を行う解析プログラムを搭載したノートパソコン、デスクトップパソコン等を用いること ができる。また本発明のゲノム解析システムの構成は、図 2のように、決定手段'ゥエツ トプロセス手段 ·取込手段 ·演算手段 ·選択手段 ·特徴パラメータ決定手段 ·収束手段 •変換手段 ·更新式構築手段 ·特徴パラメータ導出手段 ·変換収束手段 ·特徴推定手 段 ·推定手段などから構成される。 [0022] As shown in FIG. 1, the genome analysis system 1 uses the sample data to determine the characteristics of each population and each The position of the sample in the population is estimated and the analysis result is output. Sample data is sampled from a population of broad genomic information represented by genetic polymorphisms. As the genome analysis system 1, it is possible to use a notebook computer, desktop computer, or the like equipped with an analysis program for performing calculations for genome analysis described later. In addition, the configuration of the genome analysis system of the present invention is as shown in FIG. 2 in the form of determination means'wet process means, capture means, calculation means, selection means, feature parameter determination means, convergence means, conversion means, update formula construction means · Feature parameter deriving means · Conversion convergence means · Feature estimation means · Estimation means
[0023] ゲノム解析システム 1による解析の概要は、たとえば図 3に示すように、集団を特徴 付ける二つの状態変数で特徴付けすることができる実体をモデルィ匕したものである。 集団を特徴付ける状態変数とは、母集団或いは各標本より導出される統計学的統計 値のことであり、例えば各サンプルの起源母集団帰属度、起源母集団ハプロタイプ 頻度、個人のディプロタイプ頻度などが挙げられる。状態変数には、第 1の状態であ る状態 Aと第 2の状態である状態 Bとがあり、更新演算式 φと更新演算式 φとに遺伝( 統計)学の知識を埋め込むことにより、状態 Aと状態 Bとの更新演算が行われ、実体( 母集団或いは各標本)が持つ値 (状態)に収束することで、母集団の特徴や各標本 の前記母集団の中での位置付けが推定されるようになって 、る。  [0023] The outline of the analysis by the genome analysis system 1 is a model of an entity that can be characterized by two state variables that characterize a group, as shown in FIG. 3, for example. A state variable that characterizes a population is a statistical statistic derived from the population or each sample.For example, the origin population attribution of each sample, the origin population haplotype frequency, and the individual diplotype frequency. Can be mentioned. State variables include state A, which is the first state, and state B, which is the second state. By embedding genetic (statistical) knowledge in the update arithmetic expression φ and the update arithmetic expression φ, The update operation of state A and state B is performed and converges to the value (state) of the entity (population or each sample), so that the characteristics of the population and the positioning of each sample in the population are Estimated.
[0024] ここで、状態 Aとは各サンプルの起源母集団帰属度であり、状態 Bとは起源母集団 ハプロタイプ頻度である。そして、状態 Aと状態 Bとが互いに他の一方で表される更新 式を演算子として、相互に変換を行うようになっているが、この更新式の詳細につい ては後述する。  Here, state A is the origin population attribution of each sample, and state B is the origin population haplotype frequency. Then, state A and state B are converted to each other using the update expression represented by the other side as an operator. Details of this update expression will be described later.
[0025] また、ゲノム解析システム 1は、サンプルデータが属する母集団の特徴又は各標本 の母集団の中での位置付けを表す三つの変数を推定する機能、即ち第 1の変数及 び第 2の変数が第 3の変数を介して緩や力な関連を持ち、これら三つの変数を,観測 できる第 4の変数より推定する機能を有している。これは、たとえば図 3のように、状態 Aと状態 Bとが一つの集団の二つの側面を成すと考えることができることに着目したも のである。特徴パラメータ、とはこの三つの変数に他ならない。  [0025] In addition, the genome analysis system 1 has a function of estimating three variables representing characteristics of the population to which the sample data belongs or the position of each specimen in the population, that is, the first variable and the second variable. The variable has a loose relationship through the third variable, and has the function of estimating these three variables from the fourth variable that can be observed. For example, as shown in Fig. 3, we focus on the fact that state A and state B can be considered as two aspects of a group. The characteristic parameters are none other than these three variables.
[0026] そこで、第 1、第 2、第 3及び第 4の変数を以下の(7)式とする。ここで,サンプルは 1, 2, ..., Iの I個を、起源母集団は 1, 2, ..., Kの K個を、ハプロタイプは 1, 2, ..., Hの H個 を想定するものとする。 Therefore, the first, second, third, and fourth variables are defined by the following expression (7). Here, the sample is 1, Assume I of 2, ..., I, K of origin 1, 2, ..., K and H of 1, 2, ..., H of haplotypes .
[0027] [数 7] [0027] [Equation 7]
(I) . サンプル iが起源母集団 kに由来する確率: = 1 (I). Probability of sample i from origin population k: = 1
(II) . 起源 団 kに占めるハプロタイプ hの頻度: έ , Yk bk = \ (II). Frequency of haplotype h in the origin group k: Y, Y k b k = \
bKh \-i bk [ ,P—A,w]のようにべク トル標記することもある。 The vector may be labeled as b Kh \ -ib k [, P —A, w].
- - ( 7 ) --(7)
(III) . サンプル iがディプロタイプ である確率: X = 1 (III). Probability that sample i is diplotype: X = 1
ディプロタイプ は母親由来のハプロタイプと父親由来のハプロタイプの組 { ,, 2}を表す A diplotype represents a set of mother-derived and father-derived haplotypes {,, 2 }
(IV) . サンプル iについて観測されたジエノタイプ: ,  (IV). Dienotypes observed for sample i:,
[0028] これら第 1の変数と第 2の変数とは、対象となる系を特徴付ける、完全に独立でなく 、第 3の変数を介して緩やかに関連している二つの状態と考えることができる。そう考 えると上記(7)式の第 1の変数と第 2の変数とには、相互に更新する更新演算子を考 えることができる。 [0028] These first and second variables can be thought of as two states that characterize the system of interest and are not completely independent but are loosely related via the third variable. . Considering this, the first and second variables in equation (7) above can be considered as update operators that update each other.
[0029] そして、観測される第 4の変数より、それら第 1の変数と第 2の変数とに適応する更 新演算子を導出でき、それらの更新演算子に遺伝 (統計)学的な知識である遺伝情 報を埋め込むものとする。この際、第 1の変数と第 2の変数とが互いに弱く関連してい るならば、適当な初期値を与え、演算子による変換をすれば、本来その母集団が持 つ特徴或いは各標本の母集団の中での位置付けという特徴に収束することになる。  [0029] Then, from the observed fourth variable, an update operator adapted to the first variable and the second variable can be derived, and genetic (statistical) knowledge of these update operators can be derived. It is assumed that genetic information is embedded. At this time, if the first variable and the second variable are weakly related to each other, an appropriate initial value is given. It will converge to the feature of positioning in the population.
[0030] 具体例として、サンプルされた集団が、いくつかの起源母集団より構成されている 場合を考え、サンプルデータのみよりその起源母集団を推定する場合を考える。  [0030] As a specific example, consider a case where a sampled population is composed of several origin populations, and a case where the origin population is estimated from only sample data.
[0031] 遺伝学的知識を以下の(8)式のように数式ィ匕する。これは、特定のサンプルが特定 のディプロタイプである確率は、そのサンプルがどの起源母集団に由来して 、るかが 分かっており、且つその起源母集団のハプロタイプ頻度も既知であるという仮定の下 では、その起源母集団から無作為にハプロタイプを二回復元抽出することに等しい、 という単純なものである。  [0031] The genetic knowledge is expressed by the following equation (8). This is based on the assumption that the probability that a particular sample is a particular diplotype knows which origin population the sample originated from and also knows the haplotype frequency of that origin population. Below, it is as simple as restoring the haplotype twice from the original population.
[0032] [数 8]  [0032] [Equation 8]
P dP d
Figure imgf000014_0001
[0033] ここで、何れのサンプルについても、どの起源母集団に由来するかは全く均等な確 率であると!/、う単純な仮定を事前分布として以下の(9)式のように導入する。すると、 全体の確率モデルは、以下の(10)式として表される。
Figure imgf000014_0001
[0033] Here, it is assumed that for any sample, the source population is derived from a completely equal probability! /, And a simple assumption is introduced as a prior distribution as shown in Equation (9) below. To do. Then, the overall probability model is expressed as the following equation (10).
[0034] [数 9]
Figure imgf000015_0001
[0034] [Equation 9]
Figure imgf000015_0001
}| ) =Π Π . … ( 1 0) } |) = Π Π.… (1 0)
D(x,)(i, 観測されたジエノタイプ に対して、 考えうる全てのディプロタイプの集合 D (x,) (i, set of all possible diplotypes for the observed dienotype
を表す。 また、 dは右下に付随する式が正しければ 1、 そうでなければ 0を表す関数である。  Represents. D is a function that represents 1 if the expression attached to the lower right is correct, and 0 otherwise.
[0036] (10)式には、観測できない確率変数が含まれているので、 EMアルゴリズムの枠組 みで最適なパラメータを求めることを考える。具体的には、下記の(11)式に従って、 観測データを特徴付ける最適なパラメータを推定する。即ち、最適な起源母集団数、 各母集団のハプロタイプ頻度、各サンプルがどのディプロタイプであるかの確率、ど の起源母集団に由来しているかの確率、を求める。 [0036] Since equation (10) includes random variables that cannot be observed, let us consider obtaining optimal parameters within the framework of the EM algorithm. Specifically, the optimal parameters that characterize the observation data are estimated according to the following equation (11). That is, the optimal number of origin populations, the haplotype frequency of each population, the probability of which diplotype each sample is from, and the probability of which origin population originated are obtained.
[0037] [数 11]  [0037] [Equation 11]
{{6J,f}argmax(ln ({ I},{i/I},{^}|{6 ,^ {{6J, f} argmax (ln ({ I }, { i / I }, {^} | {6, ^
" p ι :漏,  "p ι: Leakage,
Figure imgf000015_0002
Figure imgf000015_0002
ここで、記号〈X〉 は、 Xの Y  Where the symbol <X> is the Y of X
Y に関する平均とする。  Average for Y.
[0038] 起源母集団数 κが既知だと仮定すると、各起源母集団のハプロタイプ頻度は、次の 二つの逐次更新式、(12)式を用いて求められる。  [0038] Assuming that the number of origin populations κ is known, the haplotype frequency of each origin population can be obtained using the following two sequential update equations, (12).
[0039] [数 12] δ [0039] [Equation 12] δ
Kョ )}) 。 (χ, ) K )))). (χ,)
ム ,)Πゾ  )
( 1 2:  (1 2:
 ∑
適当な初期値を 二設定し、 t=0, 1, . と zL及び の更新を繰り返し、 値が収束するまで続ける。 収束した値をそれぞれ f ■Α Λ ら とする '· Set two appropriate initial values, repeat t = 0, 1 ,. And z L and update until values converge. Let the converged values be f ■ Α Λ et al.
[0040] 最適な Kは、下記の(13)式のように推定することが出来る。 Kは自然数のため、 K=l , 2,…と様々な値について口内を計算することで、最適な Κを求めることができる。また 、 a,  [0040] The optimum K can be estimated as the following equation (13). Since K is a natural number, the optimal K can be obtained by calculating the mouth for various values such as K = l, 2,. Also, a,
[0041] [数 13]  [0041] [Equation 13]
K = arg max ( 1 3 ) κ K = arg max (1 3) κ
[0042] [数 14]  [0042] [Equation 14]
Figure imgf000016_0001
Figure imgf000016_0001
[0043] 次に、上記の推定方法の一部に近似を導入し、計算に必要なメモリ量を少なく済ま せる一派生法について説明する。具体的に導入する近似は次の(15)式である。こ れは、確率変数 diと kiが独立であるという仮定を意味しており、このような仮定を裏付 ける理論的根拠は乏しいが、経験的には仮定を導入しない場合と同等の推定結果 に至ることが殆どである。  [0043] Next, one derivation method for introducing an approximation into a part of the above estimation method and reducing the amount of memory necessary for calculation will be described. The approximation introduced specifically is the following equation (15). This means the assumption that the random variables di and ki are independent, and there is little theoretical basis to support such an assumption, but empirically, an estimation result equivalent to the case where no assumption is introduced. It is almost that.
[0044] [数 15]  [0044] [Equation 15]
Figure imgf000016_0002
Figure imgf000016_0002
[0045] すると、推定アルゴリズムは以下のようになる。まず、 aの値を求める。これは、起源 母集団の個数が一つと仮定することで求めることができる。具体的には以下の逐次 更新式、(16)式を用いて、 [0045] Then, the estimation algorithm is as follows. First, find the value of a. This is the origin It can be obtained by assuming that the number of population is one. Specifically, using the following sequential update equation, (16)
[0046] [数 16]
Figure imgf000017_0001
[0046] [Equation 16]
Figure imgf000017_0001
( 1 6 (1 6
-ih 2 jd jろ . rh -ih 2 jd j filtrate. r h
以後、 の収束した値を とする。  Hereafter, let the converged value of be.
[0047] 次に( 12)式を下記の( 17)式に置き換える。巨大配列である zが不要となることからNext, the equation (12) is replaced with the following equation (17). Because z, which is a huge array, is unnecessary
、大幅なメモリの節約が可能となる。 , Significant memory savings are possible.
[0048] [数 17]  [0048] [Equation 17]
( 1 7(1 7
Figure imgf000017_0002
以後、 及び の収束した値をそれぞれ Λ とする
Figure imgf000017_0002
Hereafter, let Λ be the converged values of and
[0049] (13)式に相当する推定式は、次の(18)式のように置き換えることが出来る。 [0049] The estimation equation corresponding to the equation (13) can be replaced as the following equation (18).
[0050] [数 18]
Figure imgf000017_0003
[0050] [Equation 18]
Figure imgf000017_0003
[0051] (14)式は、(16)式及び(17)式で既に推定が終了しているため不要となる。  [0051] Equation (14) becomes unnecessary because estimation has already been completed in equations (16) and (17).
[0052] また、起源母集団の数 Kについて、下記に従い最適解を求める工程を持つこともで きる。すなわち、変数の更新式を、(19)式のように表し、最適な Kを(20)式のように 表すことも出来る。 [0052] In addition, for the number K of the origin population, it is possible to have a step of obtaining an optimal solution according to the following. In other words, the variable update equation can be expressed as in equation (19), and the optimum K can be expressed as in equation (20).
[0053] [数 19]
Figure imgf000018_0001
Η,Η ) =
[0053] [Equation 19]
Figure imgf000018_0001
Η, Η) =
 Shi
9  9
Figure imgf000018_0002
Figure imgf000018_0002
[0054] [数 20]  [0054] [Equation 20]
K arg max 20) K arg max 20)
K
Figure imgf000018_0003
K
Figure imgf000018_0003
[0055] 上記において説明してきた本実施形態のゲノム解析方法について、従来の解析方 法を例示し比較しつつ、再度詳細にわ力りやすく説明する。尚、説明を簡単にするた めに、サンプリング対象を「人」として説明する。  [0055] The genome analysis method of the present embodiment described above will be described in detail again in an easy-to-understand manner while exemplifying and comparing conventional analysis methods. For simplicity of explanation, the sampling target is described as “person”.
[0056] 従来、単一母集団からのサンプリングに対する EMアルゴリズムを用いた最尤推定 によるディプロタイプ決定方法は存在して 、た。  [0056] Conventionally, there has been a diplotype determination method by maximum likelihood estimation using an EM algorithm for sampling from a single population.
この方法は、まず前提条件として下記の 2条件が仮定されたもとで、可能となる。 This method is possible when the following two conditions are first assumed.
(1)完全連鎖 (解析対象遺伝子 Z塩基等が単一ハプロタイプブロックに帰属する)(1) Complete linkage (Analysis target gene Z base belongs to a single haplotype block)
(2)単一母集団力ものサンプリング (調査対象の人は単一母集団に帰属する)(2) Sampling with a single population (persons surveyed belong to a single population)
[0057] 上記の条件下において、ハプロタイプのそれぞれには、下記の(21)式のように ID 番号である hが付される。尚、ハプロタイプの種類は H種類存在するものとする。 [0057] Under the above conditions, each haplotype is given an ID number h as shown in the following equation (21). There are H types of haplotypes.
[0058] [数 21] k haplotype ια ( =1 , ..., Η) (2 1)  [0058] [Equation 21] k haplotype ια (= 1, ..., Η) (2 1)
[0059] 次に、各ハプロタイプの母集団における頻度を下記の(22)式のように yと置く c [0060] [数 22] y =
Figure imgf000018_0004
···, H]: haplotvpe frequencies (22 [0061] このとき、 yは母集団における各ハプロタイプの頻度なので、 yの総和は(23)式のよ うに、 1となる。
[0059] Next, the frequency in the population of each haplotype is set to y as in the following equation (22) c [0060] [Equation 22] y =
Figure imgf000018_0004
···, H]: haplotvpe frequencies (22 [0061] At this time, since y is the frequency of each haplotype in the population, the sum of y is 1 as shown in equation (23).
[0062] [数 23]
Figure imgf000019_0001
1 … ( 2 3 )
[0062] [Equation 23]
Figure imgf000019_0001
one two Three )
[0063] また、 i番目の人のディプロタイプに(24)式のように番号を付し、 diとする。  [0063] In addition, a number is assigned to the i-th person's diplotype as shown in equation (24), and is assumed to be di.
[0064] [数 24] [0064] [Equation 24]
- [γί,ι,ώ,ι, ...]: aiplotype of -th subject^perfect information ι ··■ (24)  -[γί, ι, ώ, ι, ...]: aiplotype of -th subject ^ perfect information ι ·· ■ (24)
[0065] そして、 i番目の人につ 、て観測されたジエノタイプ(例えば複数の SNPの観測デ ータ等)についても ID番号を付し、(25)式のように xiと置く。尚、人は全部で I人いるも のとする。 [0065] For the i-th person, an ID number is assigned to the observed dienotype (eg, observation data of multiple SNPs), and is set to xi as shown in equation (25). It is assumed that there are I people in all.
[0066] [数 25] [0066] [Equation 25]
Xi: genotype id of -th subject ( = 1, 2, I) ■■■ ( 2 5 ) Xi: genotype id of -th subject (= 1, 2, I) ■■■ (2 5)
[0067] ここで、与えられたジエノタイプの種類である xに対して、考えうるディプロタイプの種 類を(26)式のように集合 D(x)と置く。すなわち、潘目の人のジエノタイプ力 と判明し ているとき、ここ力も考えうるディプロタイプのすべての可能性を、集合 D(xi)として考え る。  [0067] Here, for a given type of dienotype x, a possible diplotype type is set as set D (x) as shown in equation (26). In other words, when it is known as the dienotypic power of a person in the eye, we consider all the possibilities of diplotypes that can be considered here as a set D (xi).
[0068] [数 26]  [0068] [Equation 26]
D(x): set of possible diplotypes given x ■· · ( 2 6 ) D (x): set of possible diplotypes given x ■ (2 6)
[0069] また、 i番目の人のディプロタイプである diが、 i番目の人の有するジエノタイプである xi力も考えうるディプロタイプ集団である D(x)にあるかないかを示すインジケータ一とし て、下記の(27)式のように δを導入する。この δは、 diが D(x)に存在すれば 1、なけ れば 0の値を取る。  [0069] In addition, as an indicator indicating whether or not di that is the diplotype of the i-th person is in D (x) that is a diplotype group in which the xi force that is the dienotype of the i-th person is also considered, Δ is introduced as shown in the following equation (27). This δ takes a value of 1 if di is present in D (x), and 0 otherwise.
[0070] [数 27]  [0070] [Equation 27]
δώ e D(xi): indicator function; 1 it ώ e D(xi), or 0 ir ώ D{Xi) ■■■ ( 2 7 )  δώ e D (xi): indicator function; 1 it ώ e D (xi), or 0 ir ώ D (Xi) ■■■ (2 7)
[0071] 上記のような条件の下で、 i番目の人の持つディプロタイプの可能性について考える 。このとき、あるハプロタイプを有する可能性については独立性が成り立つので、 i番 目の人の有するディプロタイプは、下記の(28)式のようにハプロタイプ頻度の掛け算 で表すことができる。 [0071] Under the above conditions, consider the possibility of the diplotype of the i-th person. At this time, the independence holds for the possibility of having a certain haplotype. The diplotype possessed by the eye can be expressed by multiplying the haplotype frequency as shown in the following equation (28).
[0072] [数 28] yd j · · · ( 2 8[0072] [Equation 28] y d j · · · (2 8
Figure imgf000020_0001
Figure imgf000020_0001
[0073] そして、 i番目の人のディプロタイプが特定されていれば、その人のジエノタイプは下 記の(29)式に示すように一義的に決めることができる。  [0073] If the diplotype of the i-th person is specified, the person's dienotype can be uniquely determined as shown in the following equation (29).
[0074] [数 29]  [0074] [Equation 29]
P{Xi | di)≡ Ο ώ e D( xi) … ( 2 9 ) P (Xi | di) ≡ Ο ώ e D (xi)… (2 9)
[0075] 上記の(28)式及び(29)式から、 i番目の人についてのジエノタイプの確率は、下記 に示す(30)式のように表すことができる。  From the above equations (28) and (29), the dienotype probability for the i-th person can be expressed as the following equation (30).
[0076] [数 30]  [0076] [Equation 30]
Figure imgf000020_0002
Figure imgf000020_0002
[0077] ところが、このままでは計算が困難なので、ここで EMアルゴリズムを導入する。今、 求めたいのは、上記の(30)式において、観測できる変数である xi、観測できない変 数である から、モデルパラメーターである yの正解は何力、を探すことである。これは 、すなわち、(30)式の条件下で、最も良い yは何である力 を最尤推定によって求め ることと同義である。従って、本実施形態のゲノム解析方法においては、下記の(31) 式に示すように一番良 ヽ(最尤である) yを求めることを目的として、 EMアルゴリズム を導入して(30)式の確率の自然対数を取り、さらにこの平均を考え、これについての 式を解く。  However, since it is difficult to calculate as it is, the EM algorithm is introduced here. Now, what we want to find is that in the above equation (30), xi, which is an observable variable, is a variable that cannot be observed. This is equivalent to finding the force that is the best y under the condition of Eq. (30) by maximum likelihood estimation. Therefore, in the genome analysis method of the present embodiment, an EM algorithm is introduced to obtain the best y (maximum likelihood) y as shown in the following equation (31), and equation (30) is introduced. Take the natural logarithm of the probability of, then consider this average and solve the equation for it.
[0078] [数 31] y≡ arg max (in P( { } , {ώ} | y))P({mXi) y) = arg max Q( {xi} \ y) · · · ( 3 D [0079] EMアルゴリズムでは、上記の(31)式を反復代入による繰り返し計算によって、真 の値に収束させていく。このことは、数学的には下記の(32)式のように表すことがで きる。(32)式においては、まず y(t)力も始めて、 yを繰り返し演算し、 yを収束させてい く。 [0078] [Equation 31] y≡ arg max (in P ({}, {ώ} | y)) P ({mXi) y) = arg max Q ({xi} \ y) · · · (3 D [0079] In the EM algorithm, the above equation (31) is converged to a true value by iterative calculation by iterative substitution. This can be mathematically expressed as the following equation (32). In equation (32), the y (t) force is first started, y is repeatedly calculated, and y is converged.
[0080] [数 32] y{t+X) - arg max (In P( {xi} , {d} | y)) p({d,} \ w, ) ) · · · ( 3 2 ) [0080] [Equation 32] y {t + X) -arg max (In P ({xi}, {d} | y)) p ({d,} \ w,)) · · · (3 2)
[0081] このとき、観測できない diについて EMアルゴリズムによって補うことができる。ここで 、 EMアルゴリズムから、 i番目の人がディプロタイプ diを持つ確率である aを下記の(3 3)式に示すように求めることができる。このことはすなわち、各個人のディプロタイプ を決定することができることに他ならない。 [0081] At this time, di that cannot be observed can be compensated by the EM algorithm. Here, from the EM algorithm, a which is the probability that the i-th person has the diplotype di can be obtained as shown in the following equation (33). This means that each individual's diplotype can be determined.
[0082] [数 33]  [0082] [Equation 33]
E-step : Ρ(ώ I x y{t) ) … (3 3
Figure imgf000021_0001
E-step: Ρ (ώ I xy (t) )… (3 3
Figure imgf000021_0001
[0083] また同様にして、 yの収束に関しては、下記の(34)式のように示すことができる。  Similarly, the convergence of y can be expressed by the following equation (34).
[0084] [数 34] [0084] [Equation 34]
Λ if , ヽ Λ if, ヽ
M-steP · ·■ ( 3 4 )
Figure imgf000021_0002
M - ste P (3 4)
Figure imgf000021_0002
[0085] このように、単一母集団力もの推定方法は従来力も存在していたが、本発明ではこ の方法を応用して、複数母集団からのサンプルについてのディプロタイプ決定方法 を提供する。このディプロタイプ決定方法について、これより詳しくわ力りやすく説明 する。  [0085] As described above, the estimation method for a single population force also has a conventional force, but in the present invention, this method is applied to provide a diplotype determination method for samples from a plurality of populations. . This diplotype determination method will be explained in more detail and more easily.
[0086] 上述した単一母集団からの推定モデルと同様に、本発明のディプロタイプ決定方 法につ!、ても前提条件を以下のように仮定する。  [0086] As with the estimation model from the single population described above, the following assumptions are assumed even for the diplotype determination method of the present invention.
(1)完全連鎖 (解析対象遺伝子 Z塩基等が単一ハプロタイプブロックに帰属する) (1) Complete linkage (Analysis target gene Z base belongs to a single haplotype block)
(2)調査対象の各個人は、複数母集団のなかの!/、ずれかの単一母集団に帰属す る [0087] ここで、上述した単一母集団についての推定モデルと同様に、各パラメータ及び変 数を置く。まず、 ま、母集団の数を示す。そして行列 bkは、 k番目の母集団の、 IDが hであるハプロタイプを持つ人の頻度を示している。従って、この bkを k番目の母集団 に帰属するすべての人について積算すると 1となる。また、 kiとは、潘目の人は ki母集 団に帰属する、という i番目の人の ID番号である。これらの各パラメータ及び変数につ いての式を(35)式に列記する。 (2) Each individual subject to the survey belongs to one of the multiple populations! [0087] Here, as with the estimation model for the single population described above, each parameter and variable are set. First, the number of populations is shown. And the matrix bk shows the frequency of the kth population who has the haplotype with ID h. Therefore, when this bk is added to all people belonging to the kth population, it is 1. Also, ki is the ID number of the i-th person that the person in the grid belongs to the ki mother group. Equations for each of these parameters and variables are listed in Equation (35).
[0088] [数 35]  [0088] [Equation 35]
K: numoer of genetic populat ions K: numoer of genetic populat ions
b = [Dk, i, . .. , bk, H ]: haplotype frequencies m k-t group ... , 3 5 ) b = [Dk, i,. .., bk, H]: haplotype frequencies m kt group ..., 3 5 )
, H  , H
ki. z'-th subject's population id > bk, h = 1  ki.z'-th subject's population id> bk, h = 1
[0089] ここで、ある人がどの母集団に帰属するかは、均等確率とする。このことを表した(3 Here, to which population a certain person belongs is an equal probability. This was expressed (3
6)式を下記に示す。 The formula 6) is shown below.
[0090] [数 36]
Figure imgf000022_0001
[0090] [Equation 36]
Figure imgf000022_0001
[0091] このとき、ある i番目の人のディプロタイプである diは、 kiと、 ki母集団のハプロタイプ 頻度とから表すことができる。この diを表した (37)式を下記に示す (単一母集団から の推定モデル (28)式を参照)。  [0091] At this time, di, which is the diplotype of a certain i-th person, can be expressed from ki and the haplotype frequency of the ki population. The equation (37) representing this di is shown below (see the estimation model (28) from a single population).
[0092] [数 37]  [0092] [Equation 37]
Ρ(ώ I ki, bh)≡ j I bh, ώ, j … ( 3 7 ) Ρ (ώ I ki, bh) ≡ j I bh, ώ, j… (3 7)
[0093] ここで、単一母集団からの推定モデルにおける(30)式力 拡張して、 i番目の人の ジエノタイプにつ 、ての確率は、下記の(38)式のように表すことができる。 [0093] Here, the (30) power in the estimation model from a single population is expanded, and the probability for the i-th person's dienotype can be expressed as the following (38) it can.
[0094] [数 38] [0094] [Equation 38]
observed variables  observed variables
( 3 8 )
Figure imgf000022_0002
[0095] このとき、一番良い行列 bkと母集団数 Kと力 最尤推定によって求めたいものである 。このことを下記の(39)式に示す。
(3 8)
Figure imgf000022_0002
[0095] At this time, we want to obtain the best matrix bk, population number K, and force maximum likelihood estimation. This is shown in the following equation (39).
[0096] [数 39] · · · ( )
Figure imgf000023_0001
[0096] [Equation 39] · · · ()
Figure imgf000023_0001
[0097] このとき、観測できない di及び kiについては、 EMアルゴリズムによって補うことがで きる。ここで、 EMアルゴリズムから、 i番目の人がディプロタイプ diを持ち ki母集団に帰 属する確率である zを下記の(40)式に示すように求めることができる。このことはすな わち、各個人のディプロタイプ、帰属母集団、及び母集団の数を決定することができ ることに他ならない。また、この必要メモリ量のオーダーは O(IDK)程度である。  [0097] At this time, di and ki that cannot be observed can be compensated by the EM algorithm. Here, from the EM algorithm, z, which is the probability that the i-th person has the diplotype di and belongs to the ki population, can be obtained as shown in the following equation (40). This means that it is possible to determine each individual's diplotype, the assigned population, and the number of populations. Also, the order of this required memory amount is about O (IDK).
[0098] [数 40]  [0098] [Equation 40]
'(り  '(Ri
丄】,  丄],
E-step : z; , Ρ{ά I W,{b^}) 2 ;: - ( 4 0 ) y π が) E-step : z;, Ρ {ά IW, {b ^}) 2 ;:-(4 0) y π)
丄丄ゾ '■,成 ·  ■
[0099] また同様にして、 bkの収束に関しては、下記の(41)式のように示すことができる。  [0099] Similarly, the convergence of bk can be expressed by the following equation (41).
[0100] [数 41] [0100] [Equation 41]
M-step : bi =
Figure imgf000023_0002
… ( 4 1 )
M-step: bi =
Figure imgf000023_0002
… (4 1)
y y y dieD(xi) z 1w d',ky2 jsd^ - h yyy dieD (xi) z 1w d ', k y 2 js d ^ -h
[0101] しかし、このままでは計算に大容量のメモリが必要となるため、発明者らは近似を導 入することによって、計算に必要なメモリの容量のオーダーを O(IDK)から O(ID)に減じ ることが可能であることを見出した。この近似についての説明は、以下に詳細に説明 する。また、ヘテロタイプの種類数の制限は、概ね 30程度まで可能である。一般に、 EMアルゴリズムの収束に必要な計算回数は、初期値によって大きく異なるため、発 明者らは本発明のゲノム解析方法の実施においては、複数回の試行を行っている。 また、起源母集団の数 ま、最大では K=ほで取り得る力 経験則的に K=l〜ルート Κ の範囲に限定した方が妥当な結果が得られ易いことが知られているため、このことを 初期値の設定に利用することが可能である。  [0101] However, since a large amount of memory is required for the calculation as it is, the inventors introduced an approximation to change the order of the memory capacity required for the calculation from O (IDK) to O (ID). It was found that it can be reduced to A detailed description of this approximation is given below. In addition, the number of heterotypes can be limited to approximately 30. In general, since the number of calculations required for convergence of the EM algorithm varies greatly depending on the initial value, the inventors have made multiple trials in the implementation of the genome analysis method of the present invention. In addition, it is known that the maximum number of origin populations, K = the power that can be taken by the empirical rule, is that it is easier to obtain reasonable results if it is limited to the range K = l ~ root Κ. This can be used to set the initial value.
[0102] 上述の、発明者らによって見出された計算に必要なメモリ量を減じることのできる近 似の導入について、詳しく説明する。ある潘目の人のジエノタイプと ki母集団でのハ プロタイプ頻度とが既知であるとき、 i番目の人がディプロタイプ diを持ち ki母集団に帰 属する確率は、これを確率の独立性の原理から、 i番目の人がディプロタイプ diを持つ 確率と、潘目の人が ki母集団に帰属する確率との掛け算に書き換えることができる。 このことを表した (42)式を以下に示す。 [0102] As mentioned above, the amount of memory required for the calculations found by the inventors can be reduced. A similar introduction will be described in detail. When the dienotype of a person in the grid and the haplotype frequency in the ki population are known, the probability that the i-th person has the diplotype di and belongs to the ki population is From the principle, it can be rewritten as the multiplication of the probability that the i-th person has the diplotype di and the probability that the person in the eye belongs to the ki population. The equation (42) representing this is shown below.
[0103] [数 42]  [0103] [Equation 42]
Ρ{ά I - Ρ(ώ I Ρ {ά I-Ρ (ώ I
Figure imgf000024_0001
Figure imgf000024_0001
[0104] この上記の(42)式で示した aは、上述した単一母集団での最尤推定モデルでの aと 同じであるため、これを利用し EMアルゴリズムを用いることで、 i番目の人がどの母集 団力 来た力 の確率である cは、以下の(43)式のように示すことができる。この必要 メモリ量のオーダーは O(IK)程度である。  [0104] Since a shown in the above equation (42) is the same as a in the maximum likelihood estimation model in the single population described above, by using this and using the EM algorithm, the i th C, which is the probability of the power of the mother who came, can be expressed as the following equation (43). The order of this required memory is about O (IK).
[0105] [数 43]  [0105] [Equation 43]
E-step : ( 4 3 )
Figure imgf000024_0002
E-step: (4 3)
Figure imgf000024_0002
[0106] また同様にして、 bkの収束に関しては、下記の(44)式のように示すことができる。  Similarly, the convergence of bk can be expressed as the following equation (44).
の必要メモリ量のオーダーは O(KH)程度である。  The required memory order is about O (KH).
[0107] [数 44]  [0107] [Equation 44]
M-step : b kk,h h ) = ^ H7 ~" J; ― ]— 2つ ^ · · · 4 4 M-step : bk k , h h ) = ^ H7 ~ "J; ―] — Two ^ · · · 4 4
[0108] この結果、上記の (44)式で示された反復代入による計算を概ね 50〜: LOO回繰り返 すことで収束し、各個人のディプロタイプ、帰属母集団、起源母集団の数を決定する ことが可能となることを、発明者らは見出した。 [0108] As a result, the calculation by the iterative substitution shown in the above equation (44) is converged by repeating approximately 50 to: LOO times, and the number of each individual's diplotype, belonging population, and origin population The inventors have found that it is possible to determine this.
[0109] 次に、ゲノム解析システム 1によるゲノム解析ステップにつ 、て説明する。  Next, the genome analysis step by the genome analysis system 1 will be described.
まず、図 4に示すように、調査する遺伝子多型の決定を行う(ステップ S1 '決定工程 ) oここでは、まず、調査したい集団の遺伝子多型のウエットプロセスによるアレル情報 の決定を行う(ステップ S2'ウエットプロセス工程)。ウエットプロセスとは、 DNAシーケ ンサ一などを用いてサンプルの遺伝子多型などのゲノム情報を決定する工程である 。また、アレル情報より個人のハプロタイプの決定、又は推定を行う(ステップ S3 ·ハ プロタイプ推定工程)。 First, as shown in Fig. 4, the genetic polymorphism to be investigated is determined (step S1 'determination step). Here, first, allele information from the genetic polymorphism wet process of the population to be investigated (Step S2 'wet process step). The wet process is a process of determining genomic information such as genetic polymorphism of a sample using a DNA sequencer or the like. Also, haplotypes of individuals are determined or estimated from allele information (step S3 · haplotype estimation process).
[0110] 次いで、集団を表す、緩やかに関連している二つの特徴パラメータの決定を行う(ス テツプ S4'特徴パラメータ決定工程)。ここでは、サンプルの起源母集団帰属度と各 起源母集団のハプロタイプ頻度とを二つの特徴パラメータとする。また、遺伝情報と 第 3のパラメータより二つの特徴パラメータ間の更新演算子を構築する (ステップ S5 ' 更新式構築工程)。ここでの第 3のパラメータは、個人のディプロタイプとその頻度と する。更新演算子に遺伝 (統計)学的な知識即ち遺伝情報を埋め込む、とはこの第 3 のパラメータとして遺伝 (統計)学的知識である個人のディプロタイプとその頻度と 、う 情報を採用することに他ならない。  [0110] Next, two loosely related feature parameters representing the group are determined (step S4 'feature parameter determination step). Here, the origin population membership of the sample and the haplotype frequency of each origin population are used as two feature parameters. Also, an update operator between the two feature parameters is constructed from the genetic information and the third parameter (step S5 'update formula construction process). The third parameter here is the individual diplotype and its frequency. Embedding genetic (statistical) knowledge, that is, genetic information, in the update operator means adopting the diplotype and frequency of the individual, which is genetic (statistical) knowledge, and information as the third parameter. It is none other than.
[0111] また、適当な初期値より始め、更新演算子により二つの特徴パラメータを順番に求 める (ステップ S6 '特徴パラメータ導出工程)。そして、パラメータが収束するまで変換 を繰り返す (ステップ S7'変換収束工程)。その後、二つの特徴パラメータが求まる (ス テツプ S8)。更新式によって特徴パラメータを更新する、とはこの更新演算子を用い て二つの特徴パラメータを順番に求め、一方力 他方を交互に導出して、二つの特 徴パラメータを更新することに他ならない。そしてこの更新によって、パラメータを収束 させていくことが、状態変数を本来あるべき値に収束させること、即ち真の値に近似さ せていくことである。  [0111] Also, starting from an appropriate initial value, two feature parameters are obtained in turn by an update operator (step S6 'feature parameter derivation step). The conversion is repeated until the parameters converge (step S7 ′ conversion convergence step). Two feature parameters are then obtained (step S8). Updating feature parameters using an update formula is nothing but updating two feature parameters by obtaining two feature parameters in turn using this update operator, and alternately deriving one force and the other. Converging the parameters by this update means converging the state variable to the original value, that is, approximating the true value.
実施例 1  Example 1
[0112] 次に、実施例について説明する。  Next, examples will be described.
図 5〜図 9は、起源母集団を推論し、かつ各サンプルを起源母集団に割り当てるた めに、複数座位の遺伝子型データ及びノヽプロタイプデータを使用した更新演算子に よるゲノム解析方法によって得られた解析結果の一例を示す図である。  Figures 5 to 9 show the results of genome analysis using an update operator that uses multilocus genotype data and neuroprotype data to infer the origin population and assign each sample to the origin population. It is a figure which shows an example of the obtained analysis result.
[0113] 遺伝子解析では、ケースコントロール相関解析力 表現型データ (たとえば疾病遺 伝子を見つける相関マッピング)に遺伝子型データをマッピングさせる強力な方法と なっている。しかし、ケースコントロール相関解析では、構造化した集団からの遺伝子 型データはデータのマッピングにエラーを生じて肯定的な結果に帰着する可能性が ある。 [0113] In gene analysis, it is a powerful method for mapping genotype data to case-control correlation analysis phenotype data (eg, correlation mapping to find disease genes). However, in case-control correlation analysis, genes from structured populations Type data can cause errors in data mapping and result in positive results.
[0114] そのため、ケースコントロール相関解析の前に潜在的な集団構造を検知することが 望ましい。潜在的な集団構造を検知する場合、ベイズ統計に基づく MCMC法、サン プル間の距離の概念に基づくクラスタモデルのような、座位のアレルを使用して構造 化した集団を識別する方法等があるが、本実施例では高速グルーピングァルゴリズ ムによる新 、モデリング方法を採用した。  [0114] Therefore, it is desirable to detect potential population structures prior to case-control correlation analysis. When detecting potential group structures, there are MCMC methods based on Bayesian statistics, cluster models based on the concept of distance between samples, and methods for identifying structured groups using locus alleles, etc. However, in this embodiment, a new modeling method using a high-speed grouping algorithm was adopted.
[0115] 高速グルーピングアルゴリズムとは本発明の解析方法であり、この場合、ハプロタイ プが対立遺伝子より強力な遺伝子情報であると考え、解析に用いる遺伝子情報とし て対立遺伝子ではなくハプロタイプを採用した。  [0115] The fast grouping algorithm is the analysis method of the present invention. In this case, the haplotype is considered to be more powerful gene information than the allele, and the haplotype is used instead of the allele as the gene information used in the analysis.
[0116] そして、サンプルデータが属する母集団の特徴を表す二つの変数が第 3の変数を 介して緩やかに関連している場合、この三つの変数を観測できる第 4の変数を用い て推定する方法を採用した。  [0116] If two variables representing the characteristics of the population to which the sample data belongs are loosely related via the third variable, the estimation is performed using the fourth variable that can observe these three variables. The method was adopted.
[0117] 本実施例では、上述したように起源母集団のハプロタイプ頻度 bk とサンプルの起 源母集団への帰属度 cikとを集団を特徴付ける二つの状態変数として採用した。これ により、サンプリングされた個人の属する母集団の有する特徴が推定されるものと考 える。また本実施例では上述したように、二つの状態変数を結びつける第 3の変数と して個人のディプロタイプとその頻度 ai,diを、第 4の変数として観測されるデータ、即 ちジエノタイプ情報 xiを採用した。  In this example, as described above, the haplotype frequency bk of the origin population and the degree of membership cik of the sample to the origin population are adopted as two state variables characterizing the population. As a result, the characteristics of the population to which the sampled individuals belong can be estimated. In the present embodiment, as described above, the third variable linking the two state variables is the individual diplotype and its frequency ai, di, the data observed as the fourth variable, ie, the dienotype information xi. It was adopted.
[0118] まず、 ai,diを観測データ Xはり求める。具体的には、 yに適当な初期値を入れて、(4 5)式及び (46)式を順に繰り返し計算し、値が収束するまで 100回程度続ける。  [0118] First, ai and di are obtained from observation data X. Specifically, put an appropriate initial value in y, calculate (45) and (46) in order, and continue about 100 times until the value converges.
[0119] [数 45]  [0119] [Equation 45]
Figure imgf000026_0001
Figure imgf000027_0001
Figure imgf000026_0001
Figure imgf000027_0001
以後、 の収束した値を とする。  Hereafter, let the converged value of be.
[0120] 次に、起源母集団数 κを 1と仮定する。 [0120] Next, assume that the number of origin populations κ is 1.
[0121] 次に、 cに適当な初期値を設定し、(47)式及び (48)式を値が収束するまで 100回 程度繰り返し計算する。これにより、第 1, 2の状態変数が第 3の変数を介して求まる。 [0121] Next, an appropriate initial value is set for c, and (47) and (48) are repeatedly calculated about 100 times until the values converge. As a result, the first and second state variables are obtained via the third variable.
[0122] [数 46]  [0122] [Equation 46]
Figure imgf000027_0002
^^ eD( ;kr Aj
Figure imgf000027_0002
^^ eD ( ; , k r Aj
C ) =  C) =
し/ 4 8
Figure imgf000027_0003
/ 4 8
Figure imgf000027_0003
以後、 cl及び + k"の収束した値をそれぞれ Λ ' とする t Thereafter, t to cl and + k "lambda converged value of each '
[0123] また、 (47)式及び (48)式の代わりに (49)乃至(52)式を用いることも出来る c [0124] [数 47] [0123] In addition, the equations (49) to (52) can be used instead of the equations (47) and (48). C [0124] [Equation 47]
Figure imgf000027_0004
H
Figure imgf000027_0004
H
^ k.h ~ , ci,ki ^i,h,h' + ゝ h ) … ( 5 ^ kh ~, c i, ki ^ i, h, h '+ ゝ h)… (5
h.  h.
Figure imgf000028_0001
Figure imgf000028_0001
[0125] これまでに求めた値を使って、以下の(53)式の値を計算して記録しておく。  [0125] Using the values obtained so far, the following equation (53) is calculated and recorded.
[0126] [数 48] [0126] [Equation 48]
∑!∑ ∑ 乙 -Iln K … (5 3 ) ∑! ∑ 乙 乙-Iln K … ( 5 3)
[0127] また、(53)式の代わりに(54)式を用いることも出来る。(下記、(49)式を参照) [0128] [数 49] l i
Figure imgf000028_0002
… ( 5 4
[0127] In addition, equation (54) can be used instead of equation (53). (See equation (49) below) [0128] [Numerical equation 49] li
Figure imgf000028_0002
… ( 5 4
[0129] 同様に、起源母集団の数を 2, 3, 4…として (47)〜(54)の計算を繰り返す。これを ルート Iを超えな 、自然数まで繰り返す。 [0129] Similarly, the number of origin populations is 2, 3, 4 ..., and the calculations of (47) to (54) are repeated. This is repeated up to a natural number without exceeding route I.
[0130] 最後に、 (53)式又は(54)式が最も大き!/、値をとつた起源母集団の数を、最適なも のとして採用する。また、この時の各変数の値を最適な値として採用する。 [0130] Finally, the formula (53) or (54) is the largest! / And the number of origin populations with the value is adopted as the optimal one. Also, the value of each variable at this time is adopted as an optimum value.
[0131] 次に、構造ィ匕解析のデータについて説明をする。 [0131] Next, the structure analysis data will be described.
[0132] 図 5は、構造ィ匕解析プログラムの本実施例と MCMC法との実行時間の差を表すも のである。図 4に示された通り、本発明の方法は従来の方法に比して非常に高速に 結果を出力することが可能である。  FIG. 5 shows the difference in execution time between the present embodiment of the structure analysis program and the MCMC method. As shown in FIG. 4, the method of the present invention can output the result at a much higher speed than the conventional method.
[0133] また、図 6は、本実施例により推定された二つの起源母集団のハプロタイプ頻度結 果である。  [0133] Fig. 6 shows the haplotype frequency results of the two origin populations estimated by this example.
[0134] また、図 7は、本実施例により推定されたサンプルの起源母集団への帰属度: cikの 結果である。  [0134] Fig. 7 shows the result of cik belonging to the origin population of the sample estimated by this example: cik.
[0135] また、図 8は、本実施例と MCMC、クラスタ一法の様々なデータでの推測精度の比 較結果である。本発明の方法においては、従来の方法よりも精度良く推定が行われ ている。 [0135] Fig. 8 shows the ratio of the estimation accuracy for various data of this example, MCMC, and cluster method. It is a comparison result. In the method of the present invention, the estimation is performed with higher accuracy than the conventional method.
[0136] また、図 9は、本実施例の推定された起源母集団数の結果の例である。  FIG. 9 is an example of the result of the estimated number of origin populations in this example.
産業上の利用可能性  Industrial applicability
[0137] 以上の如く本発明によれば、サンプルデータにより母集団の特徴を推定するための 解析を、従来よりも高速に、且つより多くのサンプルについての解析を行うことができ る。 [0137] As described above, according to the present invention, analysis for estimating the characteristics of a population from sample data can be performed at a higher speed and with respect to more samples.
図面の簡単な説明  Brief Description of Drawings
[0138] [図 1]本発明のゲノム解析方法に用いられるゲノム解析システムの概要の説明図であ る。  [0138] [Fig. 1] An explanatory diagram of the outline of a genome analysis system used in the genome analysis method of the present invention.
[図 2]本発明のゲノム解析システムの構成図である。  FIG. 2 is a block diagram of the genome analysis system of the present invention.
[図 3]図 1のゲノム解析システムによる解析の概要を説明するための図である。  FIG. 3 is a diagram for explaining the outline of analysis by the genome analysis system of FIG. 1.
[図 4]本発明のゲノム解析方法を示すフローチャートである。  FIG. 4 is a flowchart showing the genome analysis method of the present invention.
[図 5]本発明のゲノム解析方法と MCMC法の実行時間の比較である。  FIG. 5 is a comparison of the execution time of the genome analysis method of the present invention and the MCMC method.
[図 6]起源母集団のハプロタイプ頻度の結果を表す。  [Figure 6] Represents the results of the haplotype frequency of the origin population.
[図 7]サンプルの起源母集団への帰属度を表す。  [Figure 7] Represents the degree of attribution of the sample to the origin population.
[図 8]本発明と MCMC法、クラスタ一法との起源母集団の推測結果の比較である。  [Fig. 8] Comparison of origin population estimation results between the present invention, MCMC method, and cluster one method.
[図 9]起源母集団数の推測結果である。  [Figure 9] Estimated number of origin populations.
符号の説明  Explanation of symbols
[0139] 1 ゲノム解析システム [0139] 1 Genome analysis system

Claims

請求の範囲 The scope of the claims
[1] サンプルデータが取り込まれ、  [1] Sample data is imported
該サンプルデータが属する母集団を特徴付ける状態変数又は前記サンプルデー タの各標本の前記母集団の中での位置付けを表す状態変数である、二つの第 1の 状態変数及び第 2の状態変数が選択される選択手段と、該第 1の状態変数及び第 2 の状態変数を本来あるべき値に収束させる収束手段によって、前記母集団の特徴及 び Z又は前記各標本の前記母集団の中での位置付けが推定される特徴推定手段 を有することを特徴とするゲノム解析システム。  Two first state variables and a second state variable are selected which are state variables that characterize the population to which the sample data belongs, or state variables that represent the position of each sample of the sample data in the population. And a convergence means for converging the first state variable and the second state variable to a desired value, and the characteristics of the population and Z or each sample in the population. A genome analysis system comprising a feature estimation means for estimating positioning.
[2] サンプルデータを取り込む取込手段と、演算手段とを備え、 [2] Equipped with taking-in means for taking in sample data and computing means,
該演算手段は、前記取込手段により取り込まれた前記サンプルデータが属する母 集団を特徴付ける状態変数又は前記サンプルデータの各標本の前記母集団の中で の位置付けを表す状態変数である、二つの第 1の状態変数及び第 2の状態変数を選 択し、  The calculation means is a state variable that characterizes a population to which the sample data captured by the capture means belongs, or a state variable that represents a position of each sample of the sample data in the population. Select the state variable 1 and the second state variable,
該第 1の状態変数及び第 2の状態変数を本来あるべき値に収束し、  Converge the first state variable and the second state variable to their intended values;
前記母集団の特徴及び Z又は前記各標本の前記母集団の中での位置付けを推 定することを特徴とするゲノム解析システム。  A genome analysis system characterized by estimating the characteristics of the population and the position of Z or each specimen in the population.
[3] 前記第 1の状態変数及び前記第 2の状態変数が互いに他の一方で表される遺伝( 統計)学の知識を埋め込んだ更新式を演算子として相互に変換される変換手段と、 前記第 1の状態変数及び前記第 2の状態変数がそれぞれに適応する前記更新式に 埋め込んだ第 3の状態変数により推定される推定手段とをさらに有することを特徴と する請求項 1又は 2に記載のゲノム解析システム。  [3] Conversion means for converting the first state variable and the second state variable into each other using an update expression embedded with genetic (statistical) knowledge expressed on the other side of each other, 3. The estimation apparatus according to claim 1, further comprising: an estimation unit that is estimated by a third state variable embedded in the update formula to which each of the first state variable and the second state variable is adapted. The described genome analysis system.
[4] 前記第 1の状態変数が前記サンプルデータの各サンプルの起源母集団帰属度で あり、前記第 2の状態変数が前記サンプルデータの起源母集団ハプロタイプ頻度で あることを特徴とする請求項 1乃至 3のいずれかに記載のゲノム解析システム。  [4] The first state variable is an origin population membership degree of each sample of the sample data, and the second state variable is an origin population haplotype frequency of the sample data. The genome analysis system according to any one of 1 to 3.
[5] 前記第 3の状態変数が前記サンプルデータの各サンプルのディプロタイプ及びその 頻度であることを特徴とする請求項 1乃至 4のいずれかに記載のゲノム解析システム  5. The genome analysis system according to any one of claims 1 to 4, wherein the third state variable is a diplotype and a frequency of each sample of the sample data.
[6] 前記第 1の状態変数に適応する更新式である第 1状態変数更新式が下記の(1)式 で表されることを特徴とする請求項 1乃至 5のいずれかに記載のゲノム解析システム。 [6] The first state variable update formula that is an update formula adapted to the first state variable is the following formula (1): The genome analysis system according to any one of claims 1 to 5, wherein
[数 1]  [Number 1]
Figure imgf000031_0001
Figure imgf000031_0001
[7] 前記第 2の状態変数に適応する更新式である第 2状態変数更新式が下記の(2)式 で表されることを特徴とする請求項 1乃至 6のいずれかに記載のゲノム解析システム。  [7] The genome according to any one of [1] to [6], wherein the second state variable update expression that is an update expression adapted to the second state variable is represented by the following expression (2): Analysis system.
[数 2]  [Equation 2]
Figure imgf000031_0002
Figure imgf000031_0002
[8] 前記第 2の状態変数に適応する更新式である第 2状態変数更新式が下記の(3)式 で表されることを特徴とする請求項 1乃至 7のいずれかに記載のゲノム解析システム。  [8] The genome according to any one of claims 1 to 7, wherein the second state variable update formula that is an update formula adapted to the second state variable is represented by the following formula (3): Analysis system.
[数 3]  [Equation 3]
C (り = C (Ri =
Figure imgf000031_0003
Shi
Figure imgf000031_0003
[9] 起源母集団の数 Kを下記の (4)式として最適解を求める K最適解導出手段をさら 有することを特徴とする請求項 1乃至 8のいずれかに記載のゲノム解析システム。  [9] The genome analysis system according to any one of [1] to [8], further comprising K optimum solution deriving means for obtaining an optimum solution using the number K of the origin population as the following equation (4).
[数 4]  [Equation 4]
Κ = arg max Κ = arg max
κ Σ' ^D AA Σ- DLJILNK (4) κ Σ '^ D AA Σ- 1η DLJ - ILNK (4)
[10] 起源母集団の数 Kを下記の(5)式として最適解を求める K最適解導出手段をさら 有することを特徴とする請求項 1乃至 9のいずれかに記載のゲノム解析システム。 [10] The genome analysis system according to any one of [1] to [9], further comprising K optimum solution deriving means for obtaining an optimum solution using the number K of the origin population as the following equation (5).
[数 5]  [Equation 5]
Κ = arg max (5) Κ = arg max (5)
κ ^iム , Ci,k, 1 Ιί/,εθί ,) °Ά丄 Iゾ ' ., κ ^ i, C i, k, 1 Ιί /, εθί,) ° Ά 丄 I
[11] 前記第 1の状態変数と前記第 2の状態変数を更新する更新式が下記の (6)式で表 されることを特徴とする請求項 1乃至 10のいずれかに記載のゲノム解析システム。 [11] The genome analysis according to any one of claims 1 to 10, wherein an update equation for updating the first state variable and the second state variable is represented by the following equation (6): system.
[数 6]  [Equation 6]
Figure imgf000032_0001
Figure imgf000032_0001
I = / 7 + I = / 7 +
νμ(  νμ (
, (f+l) _  , (f + l) _
Z  Z
[12] 調査する遺伝子多型の決定が行われる決定手段と、  [12] a means of determining the genetic polymorphism to be investigated;
調査したい集団の遺伝子多型についてウエットプロセスによって決定されたアレル 情報より個人のハプロタイプの決定、又は推定が行われるウエットプロセス手段と、 集団を特徴付ける特徴パラメータ及び Z又は該集団の母集団の中での位置付けを 表す特徴パラメータである、二つの特徴パラメータの決定が行われる特徴パラメータ 決定手段と、  Wet process means for determining or estimating an individual's haplotype from the allele information determined by the wet process for the genetic polymorphism of the population to be investigated, the characteristic parameters that characterize the population, and Z or the population within the population A feature parameter determining means for determining two feature parameters, which are feature parameters representing positioning;
遺伝情報より前記二つの特徴パラメータ間の更新式が構築される更新式構築手段 と、  Update formula construction means for constructing an update formula between the two feature parameters from genetic information;
所定の初期値より始め、更新式により前記二つの特徴パラメータが順番に求められ る特徴パラメータ導出手段と、  Starting from a predetermined initial value, a feature parameter deriving means for sequentially obtaining the two feature parameters by an update formula;
前記二つの特徴パラメータが収束するまで変換を繰り返す変換収束手段とを有し、 前記二つの特徴パラメータが求まることで、サンプルデータより母集団の特徴及び Conversion convergence means for repeating conversion until the two feature parameters converge, and by obtaining the two feature parameters, characteristics of the population and
Z又は前記各標本の前記母集団の中での位置付けが推定されることを特徴とする請 求項 1乃至 11の 、ずれかに記載のゲノム解析システム。 The genome analysis system according to any one of claims 1 to 11, wherein the positioning of Z or each specimen in the population is estimated.
[13] サンプルデータを取り込む取込工程と、 [13] Importing process to import sample data;
該サンプルデータが属する母集団を特徴付ける状態変数及び z又は前記各標本 の前記母集団の中での位置付けを表す状態変数である、二つの第 1の状態変数及 び第 2の状態変数を選択し、該第 1の状態変数及び第 2の状態変数を本来あるべき 値に収束させる収束工程によって、前記母集団の特徴及び Z又は前記各標本の前 記母集団の中での位置付けを推定する特徴推定工程を有することを特徴とするゲノ ム解析方法。 Two first state variables and a state variable characterizing the population to which the sample data belongs and z or a state variable representing the position of each sample in the population. And a second state variable, and a convergence process for converging the first state variable and the second state variable to a desired value, and the population characteristics and Z or the pre-population of each sample A genomic analysis method characterized by comprising a feature estimation step for estimating a position in the environment.
[14] 前記第 1の状態変数及び前記第 2の状態変数が互いに他の一方で表される遺伝( 統計)学の知識を埋め込んだ更新式を演算子として相互に変換を行う変換工程と、 前記第 1の状態変数及び前記第 2の状態変数をそれぞれに適応する前記更新式に 埋め込んだ第 3の状態変数により推定する推定工程とをさらに有することを特徴とす る請求項 13に記載のゲノム解析方法。  [14] a conversion step of performing conversion between the first state variable and the second state variable by using as an operator an update expression in which genetic (statistical) knowledge represented by the other one is represented; The estimation process according to claim 13, further comprising: an estimation step of estimating the first state variable and the second state variable by using a third state variable embedded in the update equation adapted to each of the first state variable and the second state variable. Genome analysis method.
[15] 前記第 1の状態変数が前記サンプルデータの各サンプルの起源母集団帰属度で あり、前記第 2の状態変数が前記サンプルデータの起源母集団ハプロタイプ頻度で あることを特徴とする請求項 13又は 14に記載のゲノム解析方法。  [15] The first state variable is an origin population membership degree of each sample of the sample data, and the second state variable is an origin population haplotype frequency of the sample data. The genome analysis method according to 13 or 14.
[16] 前記第 3の状態変数が前記サンプルデータの各サンプルのディプロタイプ及びそ の頻度であることを特徴とする請求項 14又は 15に記載のゲノム解析方法。  16. The genome analysis method according to claim 14 or 15, wherein the third state variable is a diplotype and a frequency of each sample of the sample data.
[17] 前記第 1の状態変数に適応する更新式である第 1状態変数更新式が下記の(1)式 で表されることを特徴とする請求項 13乃至 16のいずれかに記載のゲノム解析方法。  [17] The genome according to any one of [13] to [16], wherein the first state variable update expression that is an update expression adapted to the first state variable is represented by the following expression (1): analysis method.
[数 1]
Figure imgf000033_0001
[Number 1]
Figure imgf000033_0001
[18] 前記第 2の状態変数に適応する更新式である第 2状態変数更新式が下記の(2)式 で表されることを特徴とする請求項 13乃至 17のいずれかに記載のゲノム解析方法。  [18] The genome according to any one of claims 13 to 17, wherein the second state variable update expression that is an update expression adapted to the second state variable is represented by the following expression (2): analysis method.
[数 2]  [Equation 2]
2 2
W丄丄广 ( W 丄 丄 广(
ム i,d, ,k, ― ^―, T— τ 2, ) 、乙  I, d,, k, ― ^ ―, T― τ 2,), O
' ( )丄丄 j kr A,j '() 丄 丄 j k r A, j
[19] 前記第 2の状態変数に適応する更新式である第 2状態変数更新式が下記の(3)式 で表されることを特徴とする請求項 13乃至 18のいずれかに記載のゲノム解析方法。 [19] The second state variable update formula that is an update formula adapted to the second state variable is the following formula (3): The genome analysis method according to claim 13, wherein the genome analysis method is represented by:
[数 3] D ,、  [Equation 3] D,
C (り =  C (Ri =
Figure imgf000034_0001
Shi
Figure imgf000034_0001
[20] 起源母集団の数 Kを下記の (4)式として最適解を求める K最適解導出工程をさら 有することを特徴とする請求項 13乃至 19のいずれかに記載のゲノム解析方法。  [20] The genome analysis method according to any one of claims 13 to 19, further comprising a K optimum solution deriving step of obtaining an optimum solution using the number K of the origin population as the following equation (4).
[数 4]  [Equation 4]
Κ = arg max / 、"■ c. , ^ " In , - I\n K ( 4 ) Κ = arg max /, "■ c., ^" In,-I \ n K (4)
κ  κ
[21] 起源母集団の数 Kを下記の(5)式として最適解を求める K最適解導出工程をさら【 有することを特徴とする請求項 13乃至 20のいずれかに記載のゲノム解析方法。  [21] The genome analysis method according to any one of [13] to [20], further comprising a K optimum solution derivation step for obtaining an optimum solution using the number K of the origin population as the following equation (5).
[数 5]  [Equation 5]
( 5 )( Five )
Figure imgf000034_0002
Figure imgf000034_0002
[22] 前記第 1の状態変数と前記第 2の状態変数を更新する更新式が下記の (6)式で表 されることを特徴とする請求項 13乃至 21のいずれかに記載のゲノム解析方法。  [22] The genome analysis according to any one of claims 13 to 21, wherein an update equation for updating the first state variable and the second state variable is represented by the following equation (6): Method.
[数 6] ,
Figure imgf000034_0003
[Equation 6],
Figure imgf000034_0003
= +  = +
Ψ  Ψ
δ(ί+1) = δ (ί + 1 ) =
[23] 調査する遺伝子多型の決定を行う決定工程と、 調査したい集団の遺伝子多型のウエットプロセスによるアレル情報の決定を行うゥェ ットプロセス工程と、 [23] a determination process for determining the genetic polymorphism to be investigated; A wet process process for determining allele information by a wet process of genetic polymorphism of the population to be investigated;
前記アレル情報より個人のハプロタイプの決定、又は推定を行うハプロタイプ推定 工程と、  A haplotype estimation step of determining or estimating an individual haplotype from the allele information; and
集団を特徴付ける二つの特徴パラメータの決定を行う特徴パラメータ決定工程と、 遺伝情報より前記二つの特徴パラメータ間の更新式を構築する更新式構築工程と  A feature parameter determining step for determining two feature parameters characterizing a group, and an update formula construction step for constructing an update formula between the two feature parameters from genetic information;
所定の初期値より始め、更新式により前記二つの特徴パラメータを順番に求める特 徴パラメータ導出工程と、 前記二つの特徴パラメータが収束するまで変換を繰り返す変換収束工程とを有し、 前記二つの特徴パラメータが求まることで、前記サンプルデータより母集団の特徴 及び Z又は前記各標本の前記母集団の中での位置付けが推定されることを特徴と する請求項 13乃至 22のいずれかに記載のゲノム解析方法。 Starting with a predetermined initial value, and having a feature parameter deriving step for obtaining the two feature parameters in order by an update formula; and a conversion convergence step for repeating the conversion until the two feature parameters converge, the two features The genome according to any one of claims 13 to 22, wherein a parameter is obtained, and a feature of the population and a position of Z or each specimen in the population are estimated from the sample data. analysis method.
請求項 13乃至 23のいずれかに記載のゲノム解析方法を実行可能なプログラム。  A program capable of executing the genome analysis method according to any one of claims 13 to 23.
PCT/JP2006/313757 2006-07-11 2006-07-11 Genome analysis system, genome analysis method, and program WO2008007424A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2006/313757 WO2008007424A1 (en) 2006-07-11 2006-07-11 Genome analysis system, genome analysis method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2006/313757 WO2008007424A1 (en) 2006-07-11 2006-07-11 Genome analysis system, genome analysis method, and program

Publications (1)

Publication Number Publication Date
WO2008007424A1 true WO2008007424A1 (en) 2008-01-17

Family

ID=38922995

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2006/313757 WO2008007424A1 (en) 2006-07-11 2006-07-11 Genome analysis system, genome analysis method, and program

Country Status (1)

Country Link
WO (1) WO2008007424A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005276022A (en) * 2004-03-26 2005-10-06 Hitachi Ltd Diagnosis support system and diagnosis support method
WO2006027835A2 (en) * 2004-09-08 2006-03-16 Genesys Technologies Inc Genome analysis method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005276022A (en) * 2004-03-26 2005-10-06 Hitachi Ltd Diagnosis support system and diagnosis support method
WO2006027835A2 (en) * 2004-09-08 2006-03-16 Genesys Technologies Inc Genome analysis method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ITO T.: "Association test algorithm between a qualitative phenotype and a haplotype or haplotype set using simultaneous estimation of haplotype frequencies, diplotype configurations and diplotype-based penetrances", GENETICS, vol. 168, no. 4, 2004, pages 2339 - 2348, XP002990146 *
SHIMOSATO J., KOMAI M., KATTO J.: "A Proposal for Haplotype Estimation of Many SNPs Inputs Using Block Division", THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS, vol. 103, no. 150, 2003, pages 17 - 22, XP003002936 *
TANAKA J. ET AL.: "An unsupervised diplotype clustering method to improve race-based medicine", 2005, XP003002935, Retrieved from the Internet <URL:http://www.jsbi.org/journal/GIW05/GIW05P101.pdf> *

Similar Documents

Publication Publication Date Title
Nater et al. Resolving evolutionary relationships in closely related species with whole-genome sequencing data
Li et al. Single nucleotide mapping of trait space reveals Pareto fronts that constrain adaptation
Willems et al. Population-scale sequencing data enable precise estimates of Y-STR mutation rates
De Iorio et al. Importance sampling on coalescent histories. I
WO2020133588A1 (en) Rapid and stable method for evaluating individual animal genome breeding values
KR102487135B1 (en) Methods and systems for digesting and quantifying DNA mixtures from multiple contributors of known or unknown genotype
Wang et al. CNVeM: copy number variation detection using uncertainty of read mapping
Sun et al. Recursive test of Hardy-Weinberg equilibrium in tetraploids
Salmona et al. Inferring demographic history using genomic data
Böndel et al. The distribution of fitness effects of spontaneous mutations in Chlamydomonas reinhardtii inferred using frequency changes under experimental evolution
Nouhaud et al. Rapid and predictable genome evolution across three hybrid ant populations
Li et al. Fit-Seq2. 0: an improved software for high-throughput fitness measurements using pooled competition assays
CN117457065A (en) Method and system for identifying phenotype-associated cell types based on single-cell multi-set chemical data
KR20200135221A (en) Method and apparatus of estimating a genotype using ngs data
Davison et al. An approximate likelihood for genetic data under a model with recombination and population splitting
WO2008007424A1 (en) Genome analysis system, genome analysis method, and program
CN108090325B (en) Method for analyzing single cell sequencing data by applying beta-stability
Shpak et al. Variance in estimated pairwise genetic distance under high versus low coverage sequencing: The contribution of linkage disequilibrium
Araki et al. An estimation method for a cellular-state-specific gene regulatory network along tree-structured gene expression profiles
WO2006027835A2 (en) Genome analysis method
Zhang et al. Transfer learning across cancers on DNA copy number variation analysis
Dao et al. Variance estimation and confidence intervals from high-dimensional genome-wide association studies through misspecified mixed model analysis
Mackintosh et al. Do chromosome rearrangements fix by genetic drift or natural selection? A test in Brenthis butterflies
WO2006120752A1 (en) Genome analysis system ii
Khil et al. Variation in patterns of human meiotic recombination

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 06768070

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 06768070

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP