CN109063417A

CN109063417A - A kind of genotype complementing method constructing hidden Markov chain

Info

Publication number: CN109063417A
Application number: CN201810741480.5A
Authority: CN
Inventors: 倪晟宇; 包桉银
Original assignee: Fujian Guomai Biotechnology Co Ltd
Current assignee: Fujian Guomai Biotechnology Co Ltd
Priority date: 2018-07-09
Filing date: 2018-07-09
Publication date: 2018-12-21
Anticipated expiration: 2038-07-09
Also published as: CN109063417B

Abstract

The present invention relates to a kind of genotype complementing method for constructing hidden Markov chain, a small number of allelotypes on rare site having using sample pre-process genotypic database;Create local haplotype clustering model；Subsequent similar state is merged to the node of local haplotype clustering model;Beta pruning is carried out using access of the forward-backward algorithm to local haplotype clustering model;Using the haplotype probability acquired by Monte Carlo simulation, balances to obtain the double type of sample by hardy-weinberg and first read probability and sequencing likelihood value is combined to acquire double type posterior probability；Local haplotype clustering model is constructed again by reading probability as iteration after the double type of the micro- haplotype screening of sample；N times are recycled, the genotype probability of corresponding site is obtained after amphiploid posterior probability and obtain.The present invention constructs local haplotype cluster, and the posterior probability of double type is generated from Hidden Markov Model, and according to micro- haplotype of sample by clearly contradicted elimination, what iteration can repeatedly obtain high accuracy sentences type.

Description

A kind of genotype complementing method constructing hidden Markov chain

Technical field

The present invention relates to a kind of genotype complementing methods for constructing hidden Markov chain.

Background technique

Gene is the basic unit of heredity, carries the DNA(DNA of hereditary information) or RNA(ribonucleic acid) Hereditary information is passed to the next generation by duplication by sequence, instructs the synthesis of protein to express the heredity letter entrained by oneself Breath, to control the trait expression of bion.Genetic test is to be detected by blood, other body fluid or cell to DNA Technology, be to take detected person's peripheric venous blood or other histocytes, after expanding its gene information, by particular device to quilt DNA molecular information in tester's cell detects, and analyzes gene type contained by it and gene defect and its expressive function A kind of whether normally method.

In the gene order being sequenced, still have moiety site because sequencing depth it is inadequate, site coverage it is not high without It is detected.In order to speculate the site of these shortage of data, genotype complementing method has been introduced.It is according to that genotype, which is filled up, The method that the genotype in the site of parting carries out genetype for predicting to the site of shortage of data or the site of non-parting.Genotype Filling can reduce the cost of direct parting, and the genotype largely lacked for leading to lose is merged to different genotype parting platform It fills up, may help to the Conjoint Analysis to these data.

Due to HLA(cellular antigens) certain allele at different genes seat are often chained together heredity, and it is chain Gene be not completely random form haplotype, some genes always more occur together, cause certain haplotypes in group Higher frequency is presented in body, so as to cause linkage disequilibrium (LD).And LD can be with site range attenuation, if genotype data It does not consider this positioning when extended area determines phase, then will be made a variation by sampling and introduce error, cause between label at a distance Apparent correlation is observed, to reduce the accuracy of haplotype deduction.Based on the not linkage equilibrium between site, current base Because type filling is largely divided into two major classes, one kind is the GWAS(whole-genome association based on general population) in, utilize non-parent The information for belonging to individual is filled up, and another kind of to be mainly used for the GWAS based on family, the principle of both methods is substantially similar, by Limited in collecting sample, most fill methods are based on the first kind.

Genotype fill method on the market is mainly the complete genotype information for utilizing reference group to provide, structure at present The haplotype information of the label to interlock with one another is built out, haplotype information is then utilized, by target group's genotype deletion segment Information filling is complete.According to whether the method that is filled primarily with proposed at present includes two major classes, and one kind is to utilize using family information The linkage disequilibrium information architecture haplotype of group, corresponding software have: FAMHAP, fastPHASE, MaCH, Beagle, PLINK, BIMBAM, IMPUTE2, PHASEBOOK etc..Another kind of is to construct single times using family information and label linkage information Type, corresponding software have Find-hap, Fimpute, AlphaImpute, PEDIMPUTE etc..According to algorithm difference, and can be by this A little methods are divided into 3 classes, i.e., based on Markov chain (MCMC) algorithm, simplified leash law (parsimony) and expectation maximization Algorithm (EM algorithm).

Wherein, Findhap software is the representative of simplified leash law, it being capable of haplotype information according to pedigree or group Haplotype information carry out deletion segment filling.Its specific practice is, first by the haplotype root according to community information building It is ranked up, is then split every chromosome as a section by every 100 labels, to some according to Haplotype frequencies The genotype of each label of each section is compared individual with the genotype on the haplotype to have sorted, if than Label each to result is all non-opposite homozygote, then determines in the comparison of this site, then by the missing position in this section Point carries out homozygote filling according to the haplotype of comparison, then compares next section again.It is compared not if there is a certain section The case where haplotype of upper any section, is added to the last bit of haplotype sequence then the section forms a new haplotype, And the then all unknown gene types of the high density site in the section.The above filling based on group's haplotype information is carried out Cheng Hou, then look into pedigree, the haplotype information of the father of the individual and mother are found, and are filled out further correcting the individual Genotype.It is very fast to simplify leash law arithmetic speed, but not high with respect to for other algorithms in accuracy.PLINK is EM algorithm It represents, this algorithm

It is widely used in every research of every field, and still very when processing deviates Hardy-Weinberg equilibrium group Steadily and surely.Its detailed process is divided into E step and M step, and E step is exactly to estimate deletion segment by reference to the high density marker information of group Desired value, then reuses observation information and estimated information carries out the expectation estimation of next round, until M step, that is, estimates Desired value reach stable state, the disadvantage of this algorithm maximum is exactly that accuracy is not high.

Meanwhile as the application of high-flux sequence is gradually popularized, haplotype ratio is obtained from sequencing data and utilizes SNP chip The genotype of middle acquisition is more suitably applied to genotype to some extent and fills up algorithm.Firstly, the sequencing of two generations is polymorphic stage by stage The density ratio SNP chip test result in property site wants high more, and contains more low frequency sites.Secondly, when sequencing layer When number is lower, genotype can only be arrived by part detection, and every kind of genotype has a degree of uncertainty, this is depended on Cover the number of plies in the site.The sequencing of two generations indicates this uncertainty with genotype possibility (GL) in each site, and existing Genotype fill up in algorithm, Thunder, Beagle, Impute2 etc. can using genotype possibility as input transport It calculates.

Up to the present, some linkage disequilibriums that are not based on sentence type method all and are with using present in the sequencing of two generations Haplotype data.The sequencing of two generations has that point is numerous, and is all much rare site, in low power, each site Genotype inaccuracy, can only be described with probability.The present invention makes full use of two generation low powers that the above attribute is sequenced, and sufficiently excavates for two generations The potentiality of sequencing data, make it possible the filling up of full-length genome range.

Summary of the invention

In view of this, the purpose of the present invention is to provide a kind of genotype based on rare site construction hidden Markov chain Complementing method solves the problems, such as that genotype is filled up, and can increase substantially the accuracy rate filled up, and to a certain extent improve operation when Between.

To achieve the above object, the present invention adopts the following technical scheme:

A kind of genotype complementing method constructing hidden Markov chain, it is characterised in that: the following steps are included:

Step S1: genotypic database is carried out using a small number of allelotypes on rare site that sample to be tested has Pretreatment, obtains pretreated genotypic database；

Step S2: defining and creates local haplotype clustering model；

Step S3: subsequent similar state is merged to the node of local haplotype clustering model;

Step S4: beta pruning is carried out to the access of local haplotype clustering model using the side statistic algorithm of haplotype clustering model;

Step S5: using the haplotype probability for acquiring local haplotype clustering model by Monte Carlo simulation, in conjunction with Hardy- Genotype probability in Weinberg balance and double type site corresponding with sample to be tested acquires double type posterior probability；

Step S6: its additional weight for forming haplotype is acquired by double type posterior probability and the micro- haplotype of sample to be tested；

Step S7: reducing epicycle haplotype weight and acquires lower whorl haplotype weight with the addition of corresponding additional weight；

Step S8: circulation n times step S2-S7, wherein N is preset value, and N >=10, after n times circulation, after obtaining double type It tests probability and obtains the genotype posterior probability of corresponding site.

Further, the step S1 specifically:

Step S11: part is carried out using haplotype of the sample to be tested site information to genotypic database and sentences type;

Step S12: using a small number of allelotypes on rare site present in sample to be tested to the haplotype after sentencing type into Row judgement, will possess the weight * u of the haplotype in the rare site, and wherein u is constant;

Further, the step S2 specifically:

Step S21: assuming that have M haplotype type in Haplotype data library, the quantity of every kind of haplotype type in the database For i_m, these haplotypes do not lack in database allele;

Step S22: by haplotype type and quantity in database, local haplotype cluster Clustering Model is obtained.

Further, the local haplotype cluster Clustering Model is directed acyclic graph, which needs to meet following four spy Sign:

1), the figure have one without it is incoming while root node and one without spreading out of while end-node；

2), the figure is consistent in M+1 rank, and each node has a rank m,

The grade of node is 0, and the grade of terminal node is M；

3), m allele label such as each edge use for the child node for being m grades to same grade, derived from the two of same father node Side cannot be marked with the same allele；

4), for each haplotype in database, there is the path from root node to terminal node, so that the m of haplotype A allele is the label on m-th of side in path；And haplotype cluster each edge e indicate by all start nodes from figure to The haplotype set of the haplotype composition of the traversal path e of terminal node.

Further, the step S3 is specially；

Step S31: calculating score to each pair of node of current level, when score is less than m (n_x-1 +n_y-1) 1/2+b when, two section Point could merge；

Wherein, m and b is scale parameter and conversion parameter；

Step S32: each cluster for the local haplotype clustering model handled well is defined as biserial puppet label, then to each label Corresponding trait carries out Chi-square Test and calculates probability P value

Step S33: the result of multiple Chi-square Tests is adjusted using permutation test.

Further, the step S4 specifically:

Step S41: from the front to the back, by the quantity of the haplotype of each node of level calculation；

Step S42: two sides set out for same node, if weight has been more than 500:1, the small side of weight will be cut off;

Step S43: removal cuts off haplotype corresponding to side, corrects the weight on each side of haplotype clustering model；

Further, the step S5 specifically:

Step S51: Monte Carlo simulation is carried out to the local haplotype clustering model after beta pruning, is specially chosen at random from the off Select a paths up to terminal, and from the ratio for counting the different access numbers after P simulation, i.e., haplotype is general Rate；

Step S52: haplotype probability is done into cartesian product according to Hardy-Weinburg balance and acquires double type probability；

Step S53: sample to be tested is corresponded into the genotype probability on double type site as transmitting function, with double type probability phase Multiply the ratio for acquiring double type, and acquires the posterior probability of double type by Bayesian formula；

Further, the step S6 specifically:

Step S61: the part weight of haplotype, all same lists of summation are formed using the posterior probability of the double type acquired as it Additional weight of the part weight of times type as the haplotype；

Step S62: haplotype and the resulting micro- haplotype of two generation of sample sequencing are compared, if with its run counter to if by the attached of haplotype Weighted value is divided by 5；

Further, the step S7 specifically:

Step S71: by the weight of epicycle whole haplotype divided by 2；

Step S72: the haplotype weight by haplotype weight with the addition of corresponding additional weight as next round；

Further, the step S8 specifically:

Step S81: if current iteration number is less than n times, S2 is returned to.

Step S82: it according to the posterior probability of the available each gene loci genotype of the posterior probability of each double type, provides Error probability is as output after genotype.

Compared with the prior art, the invention has the following beneficial effects:

1, the present invention uses the sample pool of million grades of big datas as the observation state of Hidden Markov Model, and it is accurate to increase algorithm Degree；

2, the node sets that the present invention calculates hidden Markov model are the label cluster that Genotyping is carried out in target data, Reduce storage requirement and calculates the time；

3, the present invention makes database be reduced to number by million grades before filling using rare site (MAF < 1%) as seed Hundred, to effectively exponentially reduce Riming time of algorithm in the case where not damaging precision；

4, the present invention screens haplotype using the part Haplotype data that two generations were sequenced in the sampling of each iterative process, To improve the accuracy for the haplotype for sentencing type by genotype data, it is also further reduced operation time.

Detailed description of the invention

Fig. 1 is flow chart of the present invention

Fig. 2 is one embodiment of the invention datagram

Fig. 3 is one embodiment of the invention schematic diagram

Fig. 4 is another embodiment of the present invention schematic diagram.

Specific embodiment

The present invention will be further described with reference to the accompanying drawings and embodiments.

Fig. 1 is please referred to, the present invention provides a kind of genotype complementing method for constructing hidden Markov chain, it is characterised in that: The following steps are included:

Step S2: defining and creates local haplotype clustering model；

In an embodiment of the present invention, the step S1 specifically:

In an embodiment of the present invention, further, the step S2 specifically:

In an embodiment of the present invention, further, the local haplotype cluster Clustering Model is directed acyclic graph, the figure Need to meet following four feature:

2), the figure is consistent in M+1 rank, and each node has a rank m,

The grade of node is 0, and the grade of terminal node is M；

In an embodiment of the present invention, further, the step S3 is specially；

Wherein, m and b is scale parameter and conversion parameter；

In an embodiment of the present invention, further, the step S4 specifically:

In an embodiment of the present invention, further, the step S5 specifically:

Further, the step S6 specifically:

In an embodiment of the present invention, further, the step S7 specifically:

Step S71: by the weight of epicycle whole haplotype divided by 2；

Further, the step S8 specifically:

Step S81: if current iteration number is less than n times, S2 is returned to.

In order to allow those skilled in the art to better understand technical solution of the present invention, the present invention is carried out below in conjunction with attached drawing It is discussed in detail.

4.1 version of Beagle can make up calculation using the quantity of reference haplotype using millions of haplotypes as group is referred to Approximation during method.But it still can only be used according to the module of gene probabilistic type algorithm (being applied to two generation low powers to be sequenced) double Times type state is calculated and is unable to satisfy on the time so the calculating time of the module is proportional to square with reference to group haplotype quantity Filling up for the task of full-length genome range.This algorithm is compared with Beagle, mainly there is the improvement of following several respects, to make full base Because filling up for group range is possibly realized:

1) state of hidden Markov chain uses always haplotype, and determines that the crowd of haplotype is general with monte carlo method Rate just determines that the probability of the double type of sample is the cartesian product of haplotype probability, thus significantly in last calculating transmitting function Reduce state and runing time.

2) increases when initializing crowd's haplotype clustering model by comparing genotype of the target sample on rare site The weight for the haplotype that adduction target sample is consistent, thus reduce invalid exploration, iteratively faster optimization.

3) reading that is sequenced using two generations covers multiple sites and forms micro- haplotype, effectively eliminates single times of mistake Type assignment.

It in an embodiment of the present invention, will be with this algorithm and other existing Beagle algorithms and based on single window GeneImp algorithm is compared, wherein be the international thousand human genome third editions with reference to haplotype, 26 nationalitys, 2504 samples altogether This, only fills up autosome part:

Using the average value of R^2 as the functional value of MAF, each functional value is general by genotype minimum each of in the library MAF The value of the R^2 of rate is averaging and obtains, our method resulting value and BEAGLE is very nearly the same, and the GeneImp based on single window The average value of R^2 then differ larger with other two kinds.

By R2 average value, runing time and the committed memory of three kinds of methods, specific data are shown in Table 1.As shown in Table 1, base Differ larger with other two methods in the R2 of the GeneImp of single window.And our algorithm is ensuring R2 average value not When in the case where big, significantly shortening runing time, while the CPU occupied when running being also reduced into BEAGLE operation The half of CPU.

The parameter of 1 three kinds of algorithms of table compares

Algorithm	R²Average value	Runing time	Occupy CPU
				This algorithm	0.933	8.9	16 concurrent processes
BEAGLE	0.939	136.7	32 concurrent processes
				The GeneImp of single window	0.901	1.6	16 concurrent processes

Fig. 2-4 provides how to optimize single times of cluster.Assuming that 4 gene locis altogether, 1 indicates identical with reference to genome, and 2 indicate Different with reference genome, crowd's Haplotype frequencies count as shown in Fig. 2, the single times of cluster directly constructed is as shown in figure 3, here Indicated by the solid line 1, dotted line indicates 2, and the haplotype sum of each edge, available n (e are calculated in conjunction with table_A’)=311, n (e_B’)=289, n(e_c’)=195, n (e_D’)=116, n(e_E’)=289, n (e_F’)=100, n(e_G’)=95, n(e_H’)=116, n (e_I’)=137, n(e_J;)=152, n(e_K’)=21, n(e_L)=79, n (e_m’)=95, n(e_n’)=116, n(e_o’)=25, n (e_p’)=112, n(e_q’)=152。

Then backward from destination node layering, merge node, the last layer can merge and (count from top to bottom) 1 node and 4 sections Point, 2 nodes and 5 nodes.The second last layer can merge 1 node and 3 nodes.Combined node substantially bifurcated weight ratio is suitable, As shown in figure 4, which reduces number of nodes after merging.

Compare Fig. 3 and Fig. 4 to be not difficult to obtain, the 1st Section 3 point of third layer merges, the Chromosome recombination corresponding to the crowd.

The foregoing is merely presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent with Modification, is all covered by the present invention.

Claims

1. a kind of genotype complementing method for constructing hidden Markov chain, it is characterised in that: the following steps are included:

Step S2: defining and creates local haplotype clustering model；

2. a kind of genotype complementing method for constructing hidden Markov chain according to claim 1, it is characterised in that: described Step S1 specifically:

Step S12: using a small number of allelotypes on rare site present in sample to be tested to the haplotype after sentencing type into Row judgement, will possess the weight * u of the haplotype in the rare site, and wherein u is constant.

3. a kind of genotype complementing method for constructing hidden Markov chain according to claim 1, it is characterised in that: described Step S2 specifically:

4. a kind of genotype complementing method of construction hidden Markov chain according to right 3, it is characterised in that: the part Haplotype cluster Clustering Model is directed acyclic graph, which needs to meet following four feature:

2), the figure is consistent in M+1 rank, and each node has a rank m,

The grade of node is 0, and the grade of terminal node is M；

5. a kind of genotype complementing method based on rare site construction hidden Markov chain according to claim 1, Be characterized in that: the step S3 is specially；

Wherein, m and b is scale parameter and conversion parameter；

6. a kind of genotype complementing method for constructing hidden Markov chain according to claim 7, it is characterised in that: described Step S4 specifically:

Step S43: removal cuts off haplotype corresponding to side, corrects the weight on each side of haplotype clustering model.

7. a kind of genotype complementing method of construction hidden Markov chain according to claim 1, it is characterised in that: described Step S5 specifically:

Step S53: sample to be tested is corresponded into the genotype probability on double type site as transmitting function, with double type probability phase Multiply the ratio for acquiring double type, and acquires the posterior probability of double type by Bayesian formula.

8. a kind of genotype complementing method of construction hidden Markov chain according to claim 1, it is characterised in that: described Step S6 specifically:

Step S62: haplotype and the resulting micro- haplotype of two generation of sample sequencing are compared, if with its run counter to if by the attached of haplotype Weighted value is divided by 5.

9. a kind of genotype complementing method of construction hidden Markov chain according to claim, it is characterised in that: described Step S7 specifically:

Step S71: by the weight of epicycle whole haplotype divided by 2；

Step S72: the haplotype weight by haplotype weight with the addition of corresponding additional weight as next round.

10. a kind of genotype complementing method of construction hidden Markov chain according to claim, it is characterised in that: institute State step S8 specifically:

Step S81: if current iteration number is less than n times, S2 is returned to；

Step S82: according to the posterior probability of the available each gene loci genotype of the posterior probability of each double type, gene is provided Error probability is as output after type.