WO2008007424A1

WO2008007424A1 - Genome analysis system, genome analysis method, and program

Info

Publication number: WO2008007424A1
Application number: PCT/JP2006/313757
Authority: WO
Inventors: Junji Tanaka; Masato Inoue
Original assignee: Digital Information Technologies Corporation
Priority date: 2006-07-11
Filing date: 2006-07-11
Publication date: 2008-01-17

Abstract

An analysis system and method for making an analysis to estimate the feature of a population by using sampled data. Sample data is captured. Knowledge of the genetics (statistics) is embedded in an update expression of two first and second state variables featuring a group. By using the update expression, update is repeated to converge the first and second state variables to the appropriate values that should be. Thereby the feature parameter of the population to which the sample data belongs and/or the feature parameter representing the location in the population of each sample is estimated, and the result of the estimation of the feature of the population and/or the location in the population of each sample can be outputted.

Description

Specification

Genome analysis system, genome analysis method and program

Technical field

[oooi] The present invention relates to a genome analysis system, an analysis method, and a program for performing analysis for estimating the characteristics of a population and the position of Z or each specimen in the population, particularly from sample data.

Background art

[0002] Almost all living organisms existing on the earth are composed of cellular force, and each cell has a genome that records genetic information. Cells are classified into prokaryotic cells and eukaryotic cells depending on the structure. Genomes in prokaryotic cells such as bacteria and cyanobacteria are not partitioned in the cells! Exist in a state! /, But genomes in eukaryotic cells such as animals and plants are surrounded by a nuclear membrane. Exist in the! /

[0003] In other words, the genome refers to a set of chromosomes that are indispensable for carrying out life activities. A genome is a compound word made up of a gene and a chromosome.

Here, the basis of life is a cell, the cell is surrounded by a cell membrane, the nucleus is surrounded by a nuclear membrane, and the independence of each unit is maintained. Human cells are specialized cell groups that have differentiated functions and forms such as nerve cells, muscle cells, blood cells, immune system cells, epithelial cells that are cells on the surface of skin and tissues, and sensory cells. It is made up of undifferentiated cells called stem cells. Cells have important time-varying aspects. It is to make new cells by dividing cells. Cell division is an important mechanism that enables the transmission and expression of genetic information.

[0005] There are chromosomes in the nucleus. These chromosomes are the ones that carry genetic information, and the genes are contained in them. Genes mainly define how proteins are made in the genome. The basic substance that makes up a chromosome is DNA (deoxyribonucleic acid), and genetic information is stored in a sequence of four bases in DNA, A, T, G, and C. Haploid organisms such as bacteria and viruses have a single genome. [0006] A diploid organism has two sets of genomes with overlapping genetic information. For example, germ cells such as human eggs and sperm have a set of genomes with 23 chromosomal forces. Somatic cells have two sets of genomes (46 chromosomes)! The human genome consists of about 3 billion DNA base pairs (3000 megabase pairs, 1 megabase is 1 million base pairs), and a single string is about 1 meter long.

[0007] A genome is a total of gene information existing in a cell, and includes information for controlling genes and gene expression. Here, proteins and genes are so-called products and blueprints, and there are parts on the genome that control and control the production of products in addition to blueprints. At present, the significance of its existence is unknown, but there are also some areas where it seems to have some influence on the maintenance of biological functions. By clarifying these, it is believed that more accurate understanding of life phenomena will be possible.

[0008] For this reason, in addition to the “human genome analysis project” that analyzes the entire human genome sequence called the human genome, projects that determine the nucleotide sequence of the genome are being studied for various organisms. A high level of understanding of life phenomena is expected through the trinity of gene and protein research.

[0009] First of all, the network between genes must be divided. In other words, multiple proteins form a network, and these proteins exhibit specific functions. Therefore, if you study what functions and information are exchanged, you may have the power to find genes with unknown functions!

[0010] Here, genome analysis is a comprehensive analysis of the genetic information of an organism's genome, and the power to determine the base sequences of DNA molecules (GATC alignment) that make up the genome begins. However, it is not easy to determine where and what genes exist based on the ability of base sequence data alone. Therefore, analysis of gene products such as messenger RNA and proteins produced by transcription and translation, comparison of how similar the base sequences between species, and individual analysis in experimental organisms such as E. coli and budding yeast Analysis is being carried out based on data related to genes. [0011] By the way, in the case of humans, the nucleotide sequence of about 3 billion pairs of DNA contained in 24 chromosomes (that is, DNA molecules) in total, 22 autosomes, X chromosome, and Y chromosome, is the human genome. The genome information we have is the inherited genome information of the previous parent. Parents' genome information inherits the ancestral power of the previous generation. In this way, by going back to the origin of genetic information one generation ago, we can reach the genome of the first organism 3.8 billion years ago.

[0012] In Patent Document 1, genome sequence information is input as a genome analysis, and a plurality of (for example, 10) or more identical bases are continuously arranged in the input genome sequence information. If there is a sequence portion, the plurality of the same bases are continuously arranged !, and the sequence portion is continuously arranged in front and rear of the predetermined number of We propose a genome analysis method that extracts base sequence information consisting of bases and outputs the extracted base sequence information.

[0013] By using such a genome analysis method, a polymorphic marker for identifying a disease-related candidate gene can be found quickly and efficiently with an accuracy close to that of SNPs without using SNPs (single nucleotide polymorphism). It's like! /

[0014] By the way, what is shown in Patent Document 1 is a force that is a method of genome analysis that attempts to find polymorphic markers for identifying disease-related candidate genes. It is necessary to analyze the DNA base sequence as well as various viewpoints. Therefore, it has not yet been elucidated, and it is expected that there will be various methods for genome analysis, and it is expected to be elucidated.

Patent Document 1: Japanese Patent Laid-Open No. 2003-288346

Disclosure of the invention

Problems to be solved by the invention

[0015] In these conventional genome analysis methods, although diplotype determination from maximum likelihood estimation for sampling of a single population force existed, the maximum likelihood for samples from multiple populations existed. There was a problem that it was difficult to determine the diplotype of the estimated power.

[0016] The present invention has been made in view of such a situation, and solves the above problems. It is intended to provide a genome analysis system and analysis method that can estimate population characteristics and z or positioning of each sample in a population from sample data.

Means for solving the problem

The present invention has the following configurations to solve the above problems.

The gist of the invention of claim 1 is that sample data is taken in,

Two first state variables and a second state variable are selected which are state variables that characterize the population to which the sample data belongs, or state variables that represent the position of each sample of the sample data in the population. And a convergence means for converging the first state variable and the second state variable to a desired value, and the characteristics of the population and Z or each sample in the population. It exists in the genome-analysis system characterized by having the characteristic estimation means by which positioning is estimated.

The gist of the invention described in claim 2 is provided with a taking-in means for taking in sample data and a computing means,

The calculation means is a state variable that characterizes a population to which the sample data captured by the capture means belongs, or a state variable that represents a position of each sample of the sample data in the population. Select the state variable 1 and the second state variable,

Converge the first state variable and the second state variable to their intended values;

The present invention resides in a genome analysis system characterized by estimating the characteristics of the population and the positioning of Z or each specimen in the population.

The gist of the invention described in claim 3 is that an operator uses an update expression embedded with knowledge of genetics (statistics) in which the first state variable and the second state variable are represented by each other. Further comprising conversion means for mutual conversion, and estimation means for estimating the first state variable and the second state variable by a third state variable embedded in the update formula adapted to each of the first state variable and the second state variable It exists in the genome-analysis system of Claim 1 or 2 characterized by the above-mentioned.

The gist of the invention described in claim 4 is that the first state variable is an origin population belonging degree of each sample of the sample data, and the second state variable is an origin of the sample data. The genome analysis system according to any one of claims 1 to 3, wherein the frequency is a source population haplotype frequency.

The gist of the invention described in claim 5 is that the third state variable is a diplotype of each sample of the sample data and its frequency, according to any one of claims 1 to 4. It exists in the genome analysis system.

The gist of the invention described in claim 6 is that the first state variable update expression, which is an update expression adapted to the first state variable, is represented by the following expression (1): It exists in the genome analysis system in any one of.

[Number 1]

The gist of the invention described in claim 7 is that the second state variable update expression, which is an update expression adapted to the second state variable, is represented by the following expression (2): It exists in the genome analysis system in any one of.

[Number 2] sw Π丄丄zone ² shy

it, fu ₍ , mu

] ^ 丄 j ^k i A, j

The gist of the invention described in claim 8 is that the second state variable update expression, which is an update expression adapted to the second state variable, is represented by the following expression (3): It exists in the genome analysis system in any one of.

[Equation 3]

(t) ₀ , and…

The gist of the invention described in claim 9 is the power of any one of claims 1 to 8, further comprising K optimum solution deriving means for obtaining an optimum solution by using the number K of the origin population as the following equation (4): Exists in the described genome analysis system.

[Equation 4]

K = arg max /, ".j> c, \ n b,,-I \ n K (4) κ

The gist of the invention described in claim 10 is characterized in that it further comprises K optimum solution deriving means for obtaining an optimum solution by using the number K of the origin population as the following equation (5). Exists in the described genome analysis system.

[Equation 5]

The gist of the invention described in claim 11 is that the update equation for updating the first state variable and the second state variable is expressed by the following equation (6): It exists in the genome analysis system in any one of.

[Equation 6]

The gist of the invention described in claim 12 is a determining means for determining a genetic polymorphism to be investigated;

A wet process means for determining or estimating an individual's haplotype from allele information determined by the wet process for the genetic polymorphism of the population to be investigated; A feature parameter determining means for determining two feature parameters, which are a feature parameter that characterizes the group and z or a feature parameter that indicates the position of the group in the population;

Update formula construction means for constructing an update formula between the two feature parameters from genetic information;

Starting from a predetermined initial value, a feature parameter deriving means for sequentially obtaining the two feature parameters by an update formula;

Conversion convergence means for repeating conversion until the two feature parameters converge, and by obtaining the two feature parameters, characteristics of the population and

13. The genome analysis system according to any one of claims 1 to 11, wherein the positioning of Z or each sample in the population is estimated.

The gist of the invention described in claim 13 is an acquisition step of acquiring sample data, a state variable characterizing the population to which the sample data belongs, and a state variable indicating the position of z or each specimen in the population. By selecting a first state variable and a second state variable, and by converging the first state variable and the second state variable to their original values, the characteristics of the population and The present invention resides in a genome analysis method comprising a feature estimation step for estimating the position of Z or each sample in the population.

The gist of the invention described in claim 14 is that the first state variable and the second state variable are mutually expressed by using an update expression in which genetic (statistical) knowledge represented by the other one is embedded as an operator. And a conversion step of converting the first state variable and the second state variable to an estimation step for estimating the first state variable and the second state variable by using the third state variable embedded in the update equation. It exists in the genome-analysis method of Claim 13 characterized by the above-mentioned. The gist of the invention of claim 15 is that the first state variable is an origin population membership degree of each sample of the sample data, and the second state variable is an origin population haplotype frequency of the sample data. 15. The genomic analysis method according to claim 13 or 14, characterized in that:

The gist of the invention of claim 16 is that the third state variable is each sample data. The genome analysis method according to claim 14 or 15, which is a diplotype of the sample and its frequency.

The gist of the invention described in claim 17 is that the first state variable update expression which is an update expression adapted to the first state variable is expressed by the following expression (1): The genome analysis method described in any of the above.

[Equation 1] r (/ + l) yvi) y _dieD (xa Y ² sd _{1J =} h

The gist of the invention described in claim 18 is that the second state variable update expression which is an update expression adapted to the second state variable is expressed by the following expression (2): The genome analysis method described in any of the above.

[Equation 2]

_ (s, Π] Aj ² % d

_- y τ ² έ ω ^Δ)

^^ el) () 丄丄 j ^k iA

The gist of the invention described in claim 19 is that the second state variable update expression, which is an update expression adapted to the second state variable, is represented by the following expression (3): The genome analysis method described in any of the above.

[Equation 3] y

) ₌ Mu eD (xi) Tl ² b ^(t)

(tj, ₀ ,

¾ "y Π ² bit

ki) 丄丄 j ^k i A j

The gist of the invention described in claim 20 further includes a K optimum solution derivation step for obtaining an optimum solution using the number K of the origin population as the following equation (4). It exists in the genome analysis method of description. [Equation 4]

K = ar max /, c,

κ V "ln),,-I \ n K (4) The gist of the invention described in claim 21 is the K optimal solution derivation step for obtaining an optimal solution with the number K of origin population as the following equation (5): 21. The genome analysis method according to any one of claims 13 to 20, wherein the genome analysis method comprises:

[Equation 5]

The gist of the invention described in claim 22 is that the update equation for updating the first state variable and the second state variable is expressed by the following equation (6): The genome analysis method described in any of the above.

[Equation 6]

(6

The gist of the invention described in claim 23 is a determination step of determining a genetic polymorphism to be investigated, a wet process step of determining allele information by a wet process of a genetic polymorphism of a population to be investigated,

A haplotype estimation step of determining or estimating an individual haplotype from the allele information; and

A feature parameter determining step for determining two feature parameters characterizing a group, and an update formula construction step for constructing an update formula between the two feature parameters from genetic information; Starting with a predetermined initial value, and having a feature parameter deriving step for sequentially obtaining the two feature parameters by an update formula, and a conversion convergence step for repeating the conversion until the two feature parameters converge, the two features The genome according to any one of claims 13 to 22, wherein a parameter is obtained, and a feature of the population and a positioning of Z or each specimen in the population are estimated from the sample data. It exists in the analysis method.

The gist of the invention described in claim 24 resides in a program capable of executing the genome analysis method according to any one of claims 13 to 23.

The invention's effect

[0018] The genome analysis system of the present invention is a state variable that represents the characteristics of the population, and a state variable that represents the position of each sample in the population, for example, the origin population membership of each sample and each source population. It is possible to determine the frequency of haplotypes at a much higher speed than conventional methods by using genotype data and multitype data of multiple loci.

[0019] In addition, it is possible to estimate the source population with higher accuracy than the conventional method and assign each sample to the source population, and the number of samples that can be determined at one time is about 20 in the conventional method. In the method of the present invention, the results can be obtained at once for a larger number of samples.

[0020] In addition, it is possible to estimate the source population and assign each sample to the source population, even for samples from multiple populations, which was difficult with conventional genome analysis methods.

BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, embodiments of the present invention will be described.

Fig. 1 is a diagram for explaining the outline of the genome analysis system using the genome analysis method of the present invention, Fig. 2 is a block diagram of the genome analysis system of the present invention, and Fig. 3 is the genome analysis system of Fig. 1. FIG. 4 is a flow chart showing the genome analysis method of the present invention.

[0022] As shown in FIG. 1, the genome analysis system 1 uses the sample data to determine the characteristics of each population and each The position of the sample in the population is estimated and the analysis result is output. Sample data is sampled from a population of broad genomic information represented by genetic polymorphisms. As the genome analysis system 1, it is possible to use a notebook computer, desktop computer, or the like equipped with an analysis program for performing calculations for genome analysis described later. In addition, the configuration of the genome analysis system of the present invention is as shown in FIG. 2 in the form of determination means'wet process means, capture means, calculation means, selection means, feature parameter determination means, convergence means, conversion means, update formula construction means · Feature parameter deriving means · Conversion convergence means · Feature estimation means · Estimation means

[0023] The outline of the analysis by the genome analysis system 1 is a model of an entity that can be characterized by two state variables that characterize a group, as shown in FIG. 3, for example. A state variable that characterizes a population is a statistical statistic derived from the population or each sample.For example, the origin population attribution of each sample, the origin population haplotype frequency, and the individual diplotype frequency. Can be mentioned. State variables include state A, which is the first state, and state B, which is the second state. By embedding genetic (statistical) knowledge in the update arithmetic expression φ and the update arithmetic expression φ, The update operation of state A and state B is performed and converges to the value (state) of the entity (population or each sample), so that the characteristics of the population and the positioning of each sample in the population are Estimated.

Here, state A is the origin population attribution of each sample, and state B is the origin population haplotype frequency. Then, state A and state B are converted to each other using the update expression represented by the other side as an operator. Details of this update expression will be described later.

[0025] In addition, the genome analysis system 1 has a function of estimating three variables representing characteristics of the population to which the sample data belongs or the position of each specimen in the population, that is, the first variable and the second variable. The variable has a loose relationship through the third variable, and has the function of estimating these three variables from the fourth variable that can be observed. For example, as shown in Fig. 3, we focus on the fact that state A and state B can be considered as two aspects of a group. The characteristic parameters are none other than these three variables.

Therefore, the first, second, third, and fourth variables are defined by the following expression (7). Here, the sample is 1, Assume I of 2, ..., I, K of origin 1, 2, ..., K and H of 1, 2, ..., H of haplotypes .

[0027] [Equation 7]

(I). Probability of sample i from origin population k: ^{= 1}

(II). Frequency of haplotype h in the origin group k: Y, Y _k b _k = \

The vector may be labeled as b _Kh \ -ib _k [, _P —A, w].

--(7)

(III). Probability that sample i is diplotype: X = 1

A diplotype represents a set of mother-derived and father-derived haplotypes {,, ₂ }

(IV). Dienotypes observed for sample i:,

[0028] These first and second variables can be thought of as two states that characterize the system of interest and are not completely independent but are loosely related via the third variable. . Considering this, the first and second variables in equation (7) above can be considered as update operators that update each other.

[0029] Then, from the observed fourth variable, an update operator adapted to the first variable and the second variable can be derived, and genetic (statistical) knowledge of these update operators can be derived. It is assumed that genetic information is embedded. At this time, if the first variable and the second variable are weakly related to each other, an appropriate initial value is given. It will converge to the feature of positioning in the population.

[0030] As a specific example, consider a case where a sampled population is composed of several origin populations, and a case where the origin population is estimated from only sample data.

[0031] The genetic knowledge is expressed by the following equation (8). This is based on the assumption that the probability that a particular sample is a particular diplotype knows which origin population the sample originated from and also knows the haplotype frequency of that origin population. Below, it is as simple as restoring the haplotype twice from the original population.

[0032] [Equation 8]

P d

[0033] Here, it is assumed that for any sample, the source population is derived from a completely equal probability! /, And a simple assumption is introduced as a prior distribution as shown in Equation (9) below. To do. Then, the overall probability model is expressed as the following equation (10).

[0034] [Equation 9]

} |) = Π Π.… ^{(1 0)}

D (x,) (i, set of all possible diplotypes for the observed dienotype

Represents. D is a function that represents 1 if the expression attached to the lower right is correct, and 0 otherwise.

[0036] Since equation (10) includes random variables that cannot be observed, let us consider obtaining optimal parameters within the framework of the EM algorithm. Specifically, the optimal parameters that characterize the observation data are estimated according to the following equation (11). That is, the optimal number of origin populations, the haplotype frequency of each population, the probability of which diplotype each sample is from, and the probability of which origin population originated are obtained.

[0037] [Equation 11]

{{6J, f} _≡ argmax (ln ({ _I }, { _i / _I }, {^} | {6, ^

"p ι: Leakage,

Where the symbol <X> is the Y of X

Average for Y.

[0038] Assuming that the number of origin populations κ is known, the haplotype frequency of each origin population can be obtained using the following two sequential update equations, (12).

[0039] [Equation 12] δ

_K )))). (χ,)

)

(1 2:

∑

Set two appropriate initial values, repeat t = 0, 1 ^,. And ^z L and update until values converge. Let the converged values be f ■ Α Λ et al.

[0040] The optimum K can be estimated as the following equation (13). Since K is a natural number, the optimal K can be obtained by calculating the mouth for various values such as K = l, 2,. Also, a,

[0041] [Equation 13]

K = arg max (1 3) κ

[0042] [Equation 14]

[0043] Next, one derivation method for introducing an approximation into a part of the above estimation method and reducing the amount of memory necessary for calculation will be described. The approximation introduced specifically is the following equation (15). This means the assumption that the random variables di and ki are independent, and there is little theoretical basis to support such an assumption, but empirically, an estimation result equivalent to the case where no assumption is introduced. It is almost that.

[0044] [Equation 15]

[0045] Then, the estimation algorithm is as follows. First, find the value of a. This is the origin It can be obtained by assuming that the number of population is one. Specifically, using the following sequential update equation, (16)

[0046] [Equation 16]

(1 6

-ih 2 jd j filtrate. _r h

Hereafter, let the converged value of be.

Next, the equation (12) is replaced with the following equation (17). Because z, which is a huge array, is unnecessary

, Significant memory savings are possible.

[0048] [Equation 17]

(1 7

Hereafter, let _Λ be the converged values of and

[0049] The estimation equation corresponding to the equation (13) can be replaced as the following equation (18).

[0050] [Equation 18]

[0051] Equation (14) becomes unnecessary because estimation has already been completed in equations (16) and (17).

[0052] In addition, for the number K of the origin population, it is possible to have a step of obtaining an optimal solution according to the following. In other words, the variable update equation can be expressed as in equation (19), and the optimum K can be expressed as in equation (20).

[0053] [Equation 19]

Η, Η) =

Shi

9

[0054] [Equation 20]

K arg max 20)

K

[0055] The genome analysis method of the present embodiment described above will be described in detail again in an easy-to-understand manner while exemplifying and comparing conventional analysis methods. For simplicity of explanation, the sampling target is described as “person”.

[0056] Conventionally, there has been a diplotype determination method by maximum likelihood estimation using an EM algorithm for sampling from a single population.

This method is possible when the following two conditions are first assumed.

(1) Complete linkage (Analysis target gene Z base belongs to a single haplotype block)

(2) Sampling with a single population (persons surveyed belong to a single population)

[0057] Under the above conditions, each haplotype is given an ID number h as shown in the following equation (21). There are H types of haplotypes.

[0058] [Equation 21] k haplotype ια (= 1, ..., Η) (2 1)

[0059] Next, the frequency in the population of each haplotype is set to y as in the following equation (22) _c [0060] [Equation 22] y =

···, H]: haplotvpe frequencies (22 [0061] At this time, since y is the frequency of each haplotype in the population, the sum of y is 1 as shown in equation (23).

[0062] [Equation 23]

one two Three )

[0063] In addition, a number is assigned to the i-th person's diplotype as shown in equation (24), and is assumed to be di.

[0064] [Equation 24]

-[γί, ι, ώ, ι, ...]: aiplotype of -th subject ^ perfect information ι ·· ■ (24)

[0065] For the i-th person, an ID number is assigned to the observed dienotype (eg, observation data of multiple SNPs), and is set to xi as shown in equation (25). It is assumed that there are I people in all.

[0066] [Equation 25]

Xi: genotype id of -th subject (= 1, 2, I) ■■■ (2 5)

[0067] Here, for a given type of dienotype x, a possible diplotype type is set as set D (x) as shown in equation (26). In other words, when it is known as the dienotypic power of a person in the eye, we consider all the possibilities of diplotypes that can be considered here as a set D (xi).

[0068] [Equation 26]

D (x): set of possible diplotypes given x ■ (2 6)

[0069] In addition, as an indicator indicating whether or not di that is the diplotype of the i-th person is in D (x) that is a diplotype group in which the xi force that is the dienotype of the i-th person is also considered, Δ is introduced as shown in the following equation (27). This δ takes a value of 1 if di is present in D (x), and 0 otherwise.

[0070] [Equation 27]

δώ e D (xi): indicator function; 1 it ώ e D (xi), or 0 ir ώ D (Xi) ■■■ (2 7)

[0071] Under the above conditions, consider the possibility of the diplotype of the i-th person. At this time, the independence holds for the possibility of having a certain haplotype. The diplotype possessed by the eye can be expressed by multiplying the haplotype frequency as shown in the following equation (28).

[0072] [Equation 28] y _d j · · · (2 8

[0073] If the diplotype of the i-th person is specified, the person's dienotype can be uniquely determined as shown in the following equation (29).

[0074] [Equation 29]

P (Xi | di) ≡ Ο ώ e D (xi)… (2 9)

From the above equations (28) and (29), the dienotype probability for the i-th person can be expressed as the following equation (30).

[0076] [Equation 30]

However, since it is difficult to calculate as it is, the EM algorithm is introduced here. Now, what we want to find is that in the above equation (30), xi, which is an observable variable, is a variable that cannot be observed. This is equivalent to finding the force that is the best y under the condition of Eq. (30) by maximum likelihood estimation. Therefore, in the genome analysis method of the present embodiment, an EM algorithm is introduced to obtain the best y (maximum likelihood) y as shown in the following equation (31), and equation (30) is introduced. Take the natural logarithm of the probability of, then consider this average and solve the equation for it.

[0078] [Equation 31] y≡ arg max (in P ({}, {ώ} | y)) _{P ({mXi) y)} = arg max Q ({xi} \ y) · · · (3 D [0079] In the EM algorithm, the above equation (31) is converged to a true value by iterative calculation by iterative substitution. This can be mathematically expressed as the following equation (32). In equation (32), the y (t) force is first started, y is repeatedly calculated, and y is converged.

[0080] [Equation 32] y ^{{t + X)} -arg max (In P ({xi}, {d} | y)) p ({d,} \ w,)) · · · (3 2)

[0081] At this time, di that cannot be observed can be compensated by the EM algorithm. Here, from the EM algorithm, a which is the probability that the i-th person has the diplotype di can be obtained as shown in the following equation (33). This means that each individual's diplotype can be determined.

[0082] [Equation 33]

E-step: Ρ (ώ I xy ^(t) )… (3 3

Similarly, the convergence of y can be expressed by the following equation (34).

[0084] [Equation 34]

_Λ if, ヽ

^M - ^ste P (3 4)

[0085] As described above, the estimation method for a single population force also has a conventional force, but in the present invention, this method is applied to provide a diplotype determination method for samples from a plurality of populations. . This diplotype determination method will be explained in more detail and more easily.

[0086] As with the estimation model from the single population described above, the following assumptions are assumed even for the diplotype determination method of the present invention.

(2) Each individual subject to the survey belongs to one of the multiple populations! [0087] Here, as with the estimation model for the single population described above, each parameter and variable are set. First, the number of populations is shown. And the matrix bk shows the frequency of the kth population who has the haplotype with ID h. Therefore, when this bk is added to all people belonging to the kth population, it is 1. Also, ki is the ID number of the i-th person that the person in the grid belongs to the ki mother group. Equations for each of these parameters and variables are listed in Equation (35).

[0088] [Equation 35]

K: numoer of genetic populat ions

b = [Dk, i,. .., bk, H]: haplotype frequencies m kt group ..., _{3 5} )

, H

ki.z'-th subject's population id> bk, h = 1

Here, to which population a certain person belongs is an equal probability. This was expressed (3

The formula 6) is shown below.

[0090] [Equation 36]

[0091] At this time, di, which is the diplotype of a certain i-th person, can be expressed from ki and the haplotype frequency of the ki population. The equation (37) representing this di is shown below (see the estimation model (28) from a single population).

[0092] [Equation 37]

Ρ (ώ I ki, bh) ≡ j I bh, ώ, j… (3 7)

[0093] Here, the (30) power in the estimation model from a single population is expanded, and the probability for the i-th person's dienotype can be expressed as the following (38) it can.

[0094] [Equation 38]

observed variables

(3 8)

[0095] At this time, we want to obtain the best matrix bk, population number K, and force maximum likelihood estimation. This is shown in the following equation (39).

[0096] [Equation 39] · · · ()

[0097] At this time, di and ki that cannot be observed can be compensated by the EM algorithm. Here, from the EM algorithm, z, which is the probability that the i-th person has the diplotype di and belongs to the ki population, can be obtained as shown in the following equation (40). This means that it is possible to determine each individual's diplotype, the assigned population, and the number of populations. Also, the order of this required memory amount is about O (IDK).

[0098] [Equation 40]

'(Ri

丄],

E-step ： z;, _≡ Ρ {ά IW, {b ^}) ₂ ;:-(4 0) y π)

■

[0099] Similarly, the convergence of bk can be expressed by the following equation (41).

[0100] [Equation 41]

M-step: bi =

… (4 1)

yyy dieD (xi) z ^1w _d ', _k y ² js _d ^ _-h

[0101] However, since a large amount of memory is required for the calculation as it is, the inventors introduced an approximation to change the order of the memory capacity required for the calculation from O (IDK) to O (ID). It was found that it can be reduced to A detailed description of this approximation is given below. In addition, the number of heterotypes can be limited to approximately 30. In general, since the number of calculations required for convergence of the EM algorithm varies greatly depending on the initial value, the inventors have made multiple trials in the implementation of the genome analysis method of the present invention. In addition, it is known that the maximum number of origin populations, K = the power that can be taken by the empirical rule, is that it is easier to obtain reasonable results if it is limited to the range K = l ~ root Κ. This can be used to set the initial value.

[0102] As mentioned above, the amount of memory required for the calculations found by the inventors can be reduced. A similar introduction will be described in detail. When the dienotype of a person in the grid and the haplotype frequency in the ki population are known, the probability that the i-th person has the diplotype di and belongs to the ki population is From the principle, it can be rewritten as the multiplication of the probability that the i-th person has the diplotype di and the probability that the person in the eye belongs to the ki population. The equation (42) representing this is shown below.

[0103] [Equation 42]

Ρ {ά I-Ρ (ώ I

[0104] Since a shown in the above equation (42) is the same as a in the maximum likelihood estimation model in the single population described above, by using this and using the EM algorithm, the i th C, which is the probability of the power of the mother who came, can be expressed as the following equation (43). The order of this required memory is about O (IK).

[0105] [Equation 43]

E-step: (4 3)

Similarly, the convergence of bk can be expressed as the following equation (44).

The required memory order is about O (KH).

[0107] [Equation 44]

M-step ： bk _k , h _h ) = ^ H7 ~ "J; ―] — _Two ^ · · · 4 4

[0108] As a result, the calculation by the iterative substitution shown in the above equation (44) is converged by repeating approximately 50 to: LOO times, and the number of each individual's diplotype, belonging population, and origin population The inventors have found that it is possible to determine this.

Next, the genome analysis step by the genome analysis system 1 will be described.

First, as shown in Fig. 4, the genetic polymorphism to be investigated is determined (step S1 'determination step). Here, first, allele information from the genetic polymorphism wet process of the population to be investigated (Step S2 'wet process step). The wet process is a process of determining genomic information such as genetic polymorphism of a sample using a DNA sequencer or the like. Also, haplotypes of individuals are determined or estimated from allele information (step S3 · haplotype estimation process).

[0110] Next, two loosely related feature parameters representing the group are determined (step S4 'feature parameter determination step). Here, the origin population membership of the sample and the haplotype frequency of each origin population are used as two feature parameters. Also, an update operator between the two feature parameters is constructed from the genetic information and the third parameter (step S5 'update formula construction process). The third parameter here is the individual diplotype and its frequency. Embedding genetic (statistical) knowledge, that is, genetic information, in the update operator means adopting the diplotype and frequency of the individual, which is genetic (statistical) knowledge, and information as the third parameter. It is none other than.

[0111] Also, starting from an appropriate initial value, two feature parameters are obtained in turn by an update operator (step S6 'feature parameter derivation step). The conversion is repeated until the parameters converge (step S7 ′ conversion convergence step). Two feature parameters are then obtained (step S8). Updating feature parameters using an update formula is nothing but updating two feature parameters by obtaining two feature parameters in turn using this update operator, and alternately deriving one force and the other. Converging the parameters by this update means converging the state variable to the original value, that is, approximating the true value.

Example 1

Next, examples will be described.

Figures 5 to 9 show the results of genome analysis using an update operator that uses multilocus genotype data and neuroprotype data to infer the origin population and assign each sample to the origin population. It is a figure which shows an example of the obtained analysis result.

[0113] In gene analysis, it is a powerful method for mapping genotype data to case-control correlation analysis phenotype data (eg, correlation mapping to find disease genes). However, in case-control correlation analysis, genes from structured populations Type data can cause errors in data mapping and result in positive results.

[0114] Therefore, it is desirable to detect potential population structures prior to case-control correlation analysis. When detecting potential group structures, there are MCMC methods based on Bayesian statistics, cluster models based on the concept of distance between samples, and methods for identifying structured groups using locus alleles, etc. However, in this embodiment, a new modeling method using a high-speed grouping algorithm was adopted.

[0115] The fast grouping algorithm is the analysis method of the present invention. In this case, the haplotype is considered to be more powerful gene information than the allele, and the haplotype is used instead of the allele as the gene information used in the analysis.

[0116] If two variables representing the characteristics of the population to which the sample data belongs are loosely related via the third variable, the estimation is performed using the fourth variable that can observe these three variables. The method was adopted.

In this example, as described above, the haplotype frequency bk of the origin population and the degree of membership cik of the sample to the origin population are adopted as two state variables characterizing the population. As a result, the characteristics of the population to which the sampled individuals belong can be estimated. In the present embodiment, as described above, the third variable linking the two state variables is the individual diplotype and its frequency ai, di, the data observed as the fourth variable, ie, the dienotype information xi. It was adopted.

[0118] First, ai and di are obtained from observation data X. Specifically, put an appropriate initial value in y, calculate (45) and (46) in order, and continue about 100 times until the value converges.

[0119] [Equation 45]

Hereafter, let the converged value of be.

[0120] Next, assume that the number of origin populations κ is 1.

[0121] Next, an appropriate initial value is set for c, and (47) and (48) are repeatedly calculated about 100 times until the values converge. As a result, the first and second state variables are obtained via the third variable.

[0122] [Equation 46]

^^ eD ( _; , ^k r Aj

C) =

/ 4 8

Thereafter, _t to cl and ⁺ _k _"lambda converged value of each '

[0123] In addition, the equations (49) to (52) can be used instead of the equations (47) and (48). _C [0124] [Equation 47]

H

^ kh ~, ^c i, ki ^ i, h, h '+ ゝ h)… (5

h.

[0125] Using the values obtained so far, the following equation (53) is calculated and recorded.

[0126] [Equation 48]

∑! ∑ 乙乙^{-Iln K} … ( ^{5 3)}

[0127] In addition, equation (54) can be used instead of equation (53). (See equation (49) below) [0128] [Numerical equation 49] li

… ( ^{5 4}

[0129] Similarly, the number of origin populations is 2, 3, 4 ..., and the calculations of (47) to (54) are repeated. This is repeated up to a natural number without exceeding route I.

[0130] Finally, the formula (53) or (54) is the largest! / And the number of origin populations with the value is adopted as the optimal one. Also, the value of each variable at this time is adopted as an optimum value.

[0131] Next, the structure analysis data will be described.

FIG. 5 shows the difference in execution time between the present embodiment of the structure analysis program and the MCMC method. As shown in FIG. 4, the method of the present invention can output the result at a much higher speed than the conventional method.

[0133] Fig. 6 shows the haplotype frequency results of the two origin populations estimated by this example.

[0134] Fig. 7 shows the result of cik belonging to the origin population of the sample estimated by this example: cik.

[0135] Fig. 8 shows the ratio of the estimation accuracy for various data of this example, MCMC, and cluster method. It is a comparison result. In the method of the present invention, the estimation is performed with higher accuracy than the conventional method.

FIG. 9 is an example of the result of the estimated number of origin populations in this example.

Industrial applicability

[0137] As described above, according to the present invention, analysis for estimating the characteristics of a population from sample data can be performed at a higher speed and with respect to more samples.

Brief Description of Drawings

[0138] [Fig. 1] An explanatory diagram of the outline of a genome analysis system used in the genome analysis method of the present invention.

FIG. 2 is a block diagram of the genome analysis system of the present invention.

FIG. 3 is a diagram for explaining the outline of analysis by the genome analysis system of FIG. 1.

FIG. 4 is a flowchart showing the genome analysis method of the present invention.

FIG. 5 is a comparison of the execution time of the genome analysis method of the present invention and the MCMC method.

[Figure 6] Represents the results of the haplotype frequency of the origin population.

[Figure 7] Represents the degree of attribution of the sample to the origin population.

[Fig. 8] Comparison of origin population estimation results between the present invention, MCMC method, and cluster one method.

[Figure 9] Estimated number of origin populations.

Explanation of symbols

[0139] 1 Genome analysis system

Claims

The scope of the claims

[1] Sample data is imported

Two first state variables and a second state variable are selected which are state variables that characterize the population to which the sample data belongs, or state variables that represent the position of each sample of the sample data in the population. And a convergence means for converging the first state variable and the second state variable to a desired value, and the characteristics of the population and Z or each sample in the population. A genome analysis system comprising a feature estimation means for estimating positioning.

[2] Equipped with taking-in means for taking in sample data and computing means,

A genome analysis system characterized by estimating the characteristics of the population and the position of Z or each specimen in the population.

[3] Conversion means for converting the first state variable and the second state variable into each other using an update expression embedded with genetic (statistical) knowledge expressed on the other side of each other, 3. The estimation apparatus according to claim 1, further comprising: an estimation unit that is estimated by a third state variable embedded in the update formula to which each of the first state variable and the second state variable is adapted. The described genome analysis system.

[4] The first state variable is an origin population membership degree of each sample of the sample data, and the second state variable is an origin population haplotype frequency of the sample data. The genome analysis system according to any one of 1 to 3.

5. The genome analysis system according to any one of claims 1 to 4, wherein the third state variable is a diplotype and a frequency of each sample of the sample data.

[6] The first state variable update formula that is an update formula adapted to the first state variable is the following formula (1): The genome analysis system according to any one of claims 1 to 5, wherein

[Number 1]

[7] The genome according to any one of [1] to [6], wherein the second state variable update expression that is an update expression adapted to the second state variable is represented by the following expression (2): Analysis system.

[Equation 2]

[8] The genome according to any one of claims 1 to 7, wherein the second state variable update formula that is an update formula adapted to the second state variable is represented by the following formula (3): Analysis system.

[Equation 3]

C (Ri =

Shi

[9] The genome analysis system according to any one of [1] to [8], further comprising K optimum solution deriving means for obtaining an optimum solution using the number K of the origin population as the following equation (4).

[Equation 4]

Κ = arg max

^{κ Σ '^ D AA Σ- 1η} DLJ - ILNK (4)

[10] The genome analysis system according to any one of [1] to [9], further comprising K optimum solution deriving means for obtaining an optimum solution using the number K of the origin population as the following equation (5).

[Equation 5]

Κ = arg max (5)

κ ^ i, ^C i, k, 1 Ιί /, εθί,) ° Ά 丄 I

[11] The genome analysis according to any one of claims 1 to 10, wherein an update equation for updating the first state variable and the second state variable is represented by the following equation (6): system.

[Equation 6]

I = / 7 +

νμ (

, (f + l) _

Z

[12] a means of determining the genetic polymorphism to be investigated;

Wet process means for determining or estimating an individual's haplotype from the allele information determined by the wet process for the genetic polymorphism of the population to be investigated, the characteristic parameters that characterize the population, and Z or the population within the population A feature parameter determining means for determining two feature parameters, which are feature parameters representing positioning;

The genome analysis system according to any one of claims 1 to 11, wherein the positioning of Z or each specimen in the population is estimated.

[13] Importing process to import sample data;

Two first state variables and a state variable characterizing the population to which the sample data belongs and z or a state variable representing the position of each sample in the population. And a second state variable, and a convergence process for converging the first state variable and the second state variable to a desired value, and the population characteristics and Z or the pre-population of each sample A genomic analysis method characterized by comprising a feature estimation step for estimating a position in the environment.

[14] a conversion step of performing conversion between the first state variable and the second state variable by using as an operator an update expression in which genetic (statistical) knowledge represented by the other one is represented; The estimation process according to claim 13, further comprising: an estimation step of estimating the first state variable and the second state variable by using a third state variable embedded in the update equation adapted to each of the first state variable and the second state variable. Genome analysis method.

[15] The first state variable is an origin population membership degree of each sample of the sample data, and the second state variable is an origin population haplotype frequency of the sample data. The genome analysis method according to 13 or 14.

16. The genome analysis method according to claim 14 or 15, wherein the third state variable is a diplotype and a frequency of each sample of the sample data.

[17] The genome according to any one of [13] to [16], wherein the first state variable update expression that is an update expression adapted to the first state variable is represented by the following expression (1): analysis method.

[Number 1]

[18] The genome according to any one of claims 13 to 17, wherein the second state variable update expression that is an update expression adapted to the second state variable is represented by the following expression (2): analysis method.

[Equation 2]

²

W 丄丄广₍

I, d,, k, ― ^ ―, T― τ 2,), O

'() 丄丄 j ^k r A, j

[19] The second state variable update formula that is an update formula adapted to the second state variable is the following formula (3): The genome analysis method according to claim 13, wherein the genome analysis method is represented by:

[Equation 3] D,

C (Ri =

Shi

[20] The genome analysis method according to any one of claims 13 to 19, further comprising a K optimum solution deriving step of obtaining an optimum solution using the number K of the origin population as the following equation (4).

[Equation 4]

Κ = arg max /, "■ c., ^" In,-I \ n K (4)

κ

[21] The genome analysis method according to any one of [13] to [20], further comprising a K optimum solution derivation step for obtaining an optimum solution using the number K of the origin population as the following equation (5).

[Equation 5]

( Five )

[22] The genome analysis according to any one of claims 13 to 21, wherein an update equation for updating the first state variable and the second state variable is represented by the following equation (6): Method.

[Equation 6],

= +

Ψ

δ ^{(ί + 1} ) =

[23] a determination process for determining the genetic polymorphism to be investigated; A wet process process for determining allele information by a wet process of genetic polymorphism of the population to be investigated;

A feature parameter determining step for determining two feature parameters characterizing a group, and an update formula construction step for constructing an update formula between the two feature parameters from genetic information;

Starting with a predetermined initial value, and having a feature parameter deriving step for obtaining the two feature parameters in order by an update formula; and a conversion convergence step for repeating the conversion until the two feature parameters converge, the two features The genome according to any one of claims 13 to 22, wherein a parameter is obtained, and a feature of the population and a position of Z or each specimen in the population are estimated from the sample data. analysis method.

A program capable of executing the genome analysis method according to any one of claims 13 to 23.