CN109063417A - A kind of genotype complementing method constructing hidden Markov chain - Google Patents

A kind of genotype complementing method constructing hidden Markov chain Download PDF

Info

Publication number
CN109063417A
CN109063417A CN201810741480.5A CN201810741480A CN109063417A CN 109063417 A CN109063417 A CN 109063417A CN 201810741480 A CN201810741480 A CN 201810741480A CN 109063417 A CN109063417 A CN 109063417A
Authority
CN
China
Prior art keywords
haplotype
genotype
probability
node
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810741480.5A
Other languages
Chinese (zh)
Other versions
CN109063417B (en
Inventor
倪晟宇
包桉银
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Guomai Biotechnology Co Ltd
Original Assignee
Fujian Guomai Biotechnology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Guomai Biotechnology Co Ltd filed Critical Fujian Guomai Biotechnology Co Ltd
Priority to CN201810741480.5A priority Critical patent/CN109063417B/en
Publication of CN109063417A publication Critical patent/CN109063417A/en
Application granted granted Critical
Publication of CN109063417B publication Critical patent/CN109063417B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to a kind of genotype complementing method for constructing hidden Markov chain, a small number of allelotypes on rare site having using sample pre-process genotypic database;Create local haplotype clustering model;Subsequent similar state is merged to the node of local haplotype clustering model;Beta pruning is carried out using access of the forward-backward algorithm to local haplotype clustering model;Using the haplotype probability acquired by Monte Carlo simulation, balances to obtain the double type of sample by hardy-weinberg and first read probability and sequencing likelihood value is combined to acquire double type posterior probability;Local haplotype clustering model is constructed again by reading probability as iteration after the double type of the micro- haplotype screening of sample;N times are recycled, the genotype probability of corresponding site is obtained after amphiploid posterior probability and obtain.The present invention constructs local haplotype cluster, and the posterior probability of double type is generated from Hidden Markov Model, and according to micro- haplotype of sample by clearly contradicted elimination, what iteration can repeatedly obtain high accuracy sentences type.

Description

A kind of genotype complementing method constructing hidden Markov chain
Technical field
The present invention relates to a kind of genotype complementing methods for constructing hidden Markov chain.
Background technique
Gene is the basic unit of heredity, carries the DNA(DNA of hereditary information) or RNA(ribonucleic acid) Hereditary information is passed to the next generation by duplication by sequence, instructs the synthesis of protein to express the heredity letter entrained by oneself Breath, to control the trait expression of bion.Genetic test is to be detected by blood, other body fluid or cell to DNA Technology, be to take detected person's peripheric venous blood or other histocytes, after expanding its gene information, by particular device to quilt DNA molecular information in tester's cell detects, and analyzes gene type contained by it and gene defect and its expressive function A kind of whether normally method.
In the gene order being sequenced, still have moiety site because sequencing depth it is inadequate, site coverage it is not high without It is detected.In order to speculate the site of these shortage of data, genotype complementing method has been introduced.It is according to that genotype, which is filled up, The method that the genotype in the site of parting carries out genetype for predicting to the site of shortage of data or the site of non-parting.Genotype Filling can reduce the cost of direct parting, and the genotype largely lacked for leading to lose is merged to different genotype parting platform It fills up, may help to the Conjoint Analysis to these data.
Due to HLA(cellular antigens) certain allele at different genes seat are often chained together heredity, and it is chain Gene be not completely random form haplotype, some genes always more occur together, cause certain haplotypes in group Higher frequency is presented in body, so as to cause linkage disequilibrium (LD).And LD can be with site range attenuation, if genotype data It does not consider this positioning when extended area determines phase, then will be made a variation by sampling and introduce error, cause between label at a distance Apparent correlation is observed, to reduce the accuracy of haplotype deduction.Based on the not linkage equilibrium between site, current base Because type filling is largely divided into two major classes, one kind is the GWAS(whole-genome association based on general population) in, utilize non-parent The information for belonging to individual is filled up, and another kind of to be mainly used for the GWAS based on family, the principle of both methods is substantially similar, by Limited in collecting sample, most fill methods are based on the first kind.
Genotype fill method on the market is mainly the complete genotype information for utilizing reference group to provide, structure at present The haplotype information of the label to interlock with one another is built out, haplotype information is then utilized, by target group's genotype deletion segment Information filling is complete.According to whether the method that is filled primarily with proposed at present includes two major classes, and one kind is to utilize using family information The linkage disequilibrium information architecture haplotype of group, corresponding software have: FAMHAP, fastPHASE, MaCH, Beagle, PLINK, BIMBAM, IMPUTE2, PHASEBOOK etc..Another kind of is to construct single times using family information and label linkage information Type, corresponding software have Find-hap, Fimpute, AlphaImpute, PEDIMPUTE etc..According to algorithm difference, and can be by this A little methods are divided into 3 classes, i.e., based on Markov chain (MCMC) algorithm, simplified leash law (parsimony) and expectation maximization Algorithm (EM algorithm).
Wherein, Findhap software is the representative of simplified leash law, it being capable of haplotype information according to pedigree or group Haplotype information carry out deletion segment filling.Its specific practice is, first by the haplotype root according to community information building It is ranked up, is then split every chromosome as a section by every 100 labels, to some according to Haplotype frequencies The genotype of each label of each section is compared individual with the genotype on the haplotype to have sorted, if than Label each to result is all non-opposite homozygote, then determines in the comparison of this site, then by the missing position in this section Point carries out homozygote filling according to the haplotype of comparison, then compares next section again.It is compared not if there is a certain section The case where haplotype of upper any section, is added to the last bit of haplotype sequence then the section forms a new haplotype, And the then all unknown gene types of the high density site in the section.The above filling based on group's haplotype information is carried out Cheng Hou, then look into pedigree, the haplotype information of the father of the individual and mother are found, and are filled out further correcting the individual Genotype.It is very fast to simplify leash law arithmetic speed, but not high with respect to for other algorithms in accuracy.PLINK is EM algorithm It represents, this algorithm
It is widely used in every research of every field, and still very when processing deviates Hardy-Weinberg equilibrium group Steadily and surely.Its detailed process is divided into E step and M step, and E step is exactly to estimate deletion segment by reference to the high density marker information of group Desired value, then reuses observation information and estimated information carries out the expectation estimation of next round, until M step, that is, estimates Desired value reach stable state, the disadvantage of this algorithm maximum is exactly that accuracy is not high.
Meanwhile as the application of high-flux sequence is gradually popularized, haplotype ratio is obtained from sequencing data and utilizes SNP chip The genotype of middle acquisition is more suitably applied to genotype to some extent and fills up algorithm.Firstly, the sequencing of two generations is polymorphic stage by stage The density ratio SNP chip test result in property site wants high more, and contains more low frequency sites.Secondly, when sequencing layer When number is lower, genotype can only be arrived by part detection, and every kind of genotype has a degree of uncertainty, this is depended on Cover the number of plies in the site.The sequencing of two generations indicates this uncertainty with genotype possibility (GL) in each site, and existing Genotype fill up in algorithm, Thunder, Beagle, Impute2 etc. can using genotype possibility as input transport It calculates.
Up to the present, some linkage disequilibriums that are not based on sentence type method all and are with using present in the sequencing of two generations Haplotype data.The sequencing of two generations has that point is numerous, and is all much rare site, in low power, each site Genotype inaccuracy, can only be described with probability.The present invention makes full use of two generation low powers that the above attribute is sequenced, and sufficiently excavates for two generations The potentiality of sequencing data, make it possible the filling up of full-length genome range.
Summary of the invention
In view of this, the purpose of the present invention is to provide a kind of genotype based on rare site construction hidden Markov chain Complementing method solves the problems, such as that genotype is filled up, and can increase substantially the accuracy rate filled up, and to a certain extent improve operation when Between.
To achieve the above object, the present invention adopts the following technical scheme:
A kind of genotype complementing method constructing hidden Markov chain, it is characterised in that: the following steps are included:
Step S1: genotypic database is carried out using a small number of allelotypes on rare site that sample to be tested has Pretreatment, obtains pretreated genotypic database;
Step S2: defining and creates local haplotype clustering model;
Step S3: subsequent similar state is merged to the node of local haplotype clustering model;
Step S4: beta pruning is carried out to the access of local haplotype clustering model using the side statistic algorithm of haplotype clustering model;
Step S5: using the haplotype probability for acquiring local haplotype clustering model by Monte Carlo simulation, in conjunction with Hardy- Genotype probability in Weinberg balance and double type site corresponding with sample to be tested acquires double type posterior probability;
Step S6: its additional weight for forming haplotype is acquired by double type posterior probability and the micro- haplotype of sample to be tested;
Step S7: reducing epicycle haplotype weight and acquires lower whorl haplotype weight with the addition of corresponding additional weight;
Step S8: circulation n times step S2-S7, wherein N is preset value, and N >=10, after n times circulation, after obtaining double type It tests probability and obtains the genotype posterior probability of corresponding site.
Further, the step S1 specifically:
Step S11: part is carried out using haplotype of the sample to be tested site information to genotypic database and sentences type;
Step S12: using a small number of allelotypes on rare site present in sample to be tested to the haplotype after sentencing type into Row judgement, will possess the weight * u of the haplotype in the rare site, and wherein u is constant;
Further, the step S2 specifically:
Step S21: assuming that have M haplotype type in Haplotype data library, the quantity of every kind of haplotype type in the database For im, these haplotypes do not lack in database allele;
Step S22: by haplotype type and quantity in database, local haplotype cluster Clustering Model is obtained.
Further, the local haplotype cluster Clustering Model is directed acyclic graph, which needs to meet following four spy Sign:
1), the figure have one without it is incoming while root node and one without spreading out of while end-node;
2), the figure is consistent in M+1 rank, and each node has a rank m,
The grade of node is 0, and the grade of terminal node is M;
3), m allele label such as each edge use for the child node for being m grades to same grade, derived from the two of same father node Side cannot be marked with the same allele;
4), for each haplotype in database, there is the path from root node to terminal node, so that the m of haplotype A allele is the label on m-th of side in path;And haplotype cluster each edge e indicate by all start nodes from figure to The haplotype set of the haplotype composition of the traversal path e of terminal node.
Further, the step S3 is specially;
Step S31: calculating score to each pair of node of current level, when score is less than m (nx-1 +ny-1) 1/2+b when, two section Point could merge;
Wherein, m and b is scale parameter and conversion parameter;
Step S32: each cluster for the local haplotype clustering model handled well is defined as biserial puppet label, then to each label Corresponding trait carries out Chi-square Test and calculates probability P value
Step S33: the result of multiple Chi-square Tests is adjusted using permutation test.
Further, the step S4 specifically:
Step S41: from the front to the back, by the quantity of the haplotype of each node of level calculation;
Step S42: two sides set out for same node, if weight has been more than 500:1, the small side of weight will be cut off;
Step S43: removal cuts off haplotype corresponding to side, corrects the weight on each side of haplotype clustering model;
Further, the step S5 specifically:
Step S51: Monte Carlo simulation is carried out to the local haplotype clustering model after beta pruning, is specially chosen at random from the off Select a paths up to terminal, and from the ratio for counting the different access numbers after P simulation, i.e., haplotype is general Rate;
Step S52: haplotype probability is done into cartesian product according to Hardy-Weinburg balance and acquires double type probability;
Step S53: sample to be tested is corresponded into the genotype probability on double type site as transmitting function, with double type probability phase Multiply the ratio for acquiring double type, and acquires the posterior probability of double type by Bayesian formula;
Further, the step S6 specifically:
Step S61: the part weight of haplotype, all same lists of summation are formed using the posterior probability of the double type acquired as it Additional weight of the part weight of times type as the haplotype;
Step S62: haplotype and the resulting micro- haplotype of two generation of sample sequencing are compared, if with its run counter to if by the attached of haplotype Weighted value is divided by 5;
Further, the step S7 specifically:
Step S71: by the weight of epicycle whole haplotype divided by 2;
Step S72: the haplotype weight by haplotype weight with the addition of corresponding additional weight as next round;
Further, the step S8 specifically:
Step S81: if current iteration number is less than n times, S2 is returned to.
Step S82: it according to the posterior probability of the available each gene loci genotype of the posterior probability of each double type, provides Error probability is as output after genotype.
Compared with the prior art, the invention has the following beneficial effects:
1, the present invention uses the sample pool of million grades of big datas as the observation state of Hidden Markov Model, and it is accurate to increase algorithm Degree;
2, the node sets that the present invention calculates hidden Markov model are the label cluster that Genotyping is carried out in target data, Reduce storage requirement and calculates the time;
3, the present invention makes database be reduced to number by million grades before filling using rare site (MAF < 1%) as seed Hundred, to effectively exponentially reduce Riming time of algorithm in the case where not damaging precision;
4, the present invention screens haplotype using the part Haplotype data that two generations were sequenced in the sampling of each iterative process, To improve the accuracy for the haplotype for sentencing type by genotype data, it is also further reduced operation time.
Detailed description of the invention
Fig. 1 is flow chart of the present invention
Fig. 2 is one embodiment of the invention datagram
Fig. 3 is one embodiment of the invention schematic diagram
Fig. 4 is another embodiment of the present invention schematic diagram.
Specific embodiment
The present invention will be further described with reference to the accompanying drawings and embodiments.
Fig. 1 is please referred to, the present invention provides a kind of genotype complementing method for constructing hidden Markov chain, it is characterised in that: The following steps are included:
Step S1: genotypic database is carried out using a small number of allelotypes on rare site that sample to be tested has Pretreatment, obtains pretreated genotypic database;
Step S2: defining and creates local haplotype clustering model;
Step S3: subsequent similar state is merged to the node of local haplotype clustering model;
Step S4: beta pruning is carried out to the access of local haplotype clustering model using the side statistic algorithm of haplotype clustering model;
Step S5: using the haplotype probability for acquiring local haplotype clustering model by Monte Carlo simulation, in conjunction with Hardy- Genotype probability in Weinberg balance and double type site corresponding with sample to be tested acquires double type posterior probability;
Step S6: its additional weight for forming haplotype is acquired by double type posterior probability and the micro- haplotype of sample to be tested;
Step S7: reducing epicycle haplotype weight and acquires lower whorl haplotype weight with the addition of corresponding additional weight;
Step S8: circulation n times step S2-S7, wherein N is preset value, and N >=10, after n times circulation, after obtaining double type It tests probability and obtains the genotype posterior probability of corresponding site.
In an embodiment of the present invention, the step S1 specifically:
Step S11: part is carried out using haplotype of the sample to be tested site information to genotypic database and sentences type;
Step S12: using a small number of allelotypes on rare site present in sample to be tested to the haplotype after sentencing type into Row judgement, will possess the weight * u of the haplotype in the rare site, and wherein u is constant;
In an embodiment of the present invention, further, the step S2 specifically:
Step S21: assuming that have M haplotype type in Haplotype data library, the quantity of every kind of haplotype type in the database For im, these haplotypes do not lack in database allele;
Step S22: by haplotype type and quantity in database, local haplotype cluster Clustering Model is obtained.
In an embodiment of the present invention, further, the local haplotype cluster Clustering Model is directed acyclic graph, the figure Need to meet following four feature:
1), the figure have one without it is incoming while root node and one without spreading out of while end-node;
2), the figure is consistent in M+1 rank, and each node has a rank m,
The grade of node is 0, and the grade of terminal node is M;
3), m allele label such as each edge use for the child node for being m grades to same grade, derived from the two of same father node Side cannot be marked with the same allele;
4), for each haplotype in database, there is the path from root node to terminal node, so that the m of haplotype A allele is the label on m-th of side in path;And haplotype cluster each edge e indicate by all start nodes from figure to The haplotype set of the haplotype composition of the traversal path e of terminal node.
In an embodiment of the present invention, further, the step S3 is specially;
Step S31: calculating score to each pair of node of current level, when score is less than m (nx-1 +ny-1) 1/2+b when, two section Point could merge;
Wherein, m and b is scale parameter and conversion parameter;
Step S32: each cluster for the local haplotype clustering model handled well is defined as biserial puppet label, then to each label Corresponding trait carries out Chi-square Test and calculates probability P value
Step S33: the result of multiple Chi-square Tests is adjusted using permutation test.
In an embodiment of the present invention, further, the step S4 specifically:
Step S41: from the front to the back, by the quantity of the haplotype of each node of level calculation;
Step S42: two sides set out for same node, if weight has been more than 500:1, the small side of weight will be cut off;
Step S43: removal cuts off haplotype corresponding to side, corrects the weight on each side of haplotype clustering model;
In an embodiment of the present invention, further, the step S5 specifically:
Step S51: Monte Carlo simulation is carried out to the local haplotype clustering model after beta pruning, is specially chosen at random from the off Select a paths up to terminal, and from the ratio for counting the different access numbers after P simulation, i.e., haplotype is general Rate;
Step S52: haplotype probability is done into cartesian product according to Hardy-Weinburg balance and acquires double type probability;
Step S53: sample to be tested is corresponded into the genotype probability on double type site as transmitting function, with double type probability phase Multiply the ratio for acquiring double type, and acquires the posterior probability of double type by Bayesian formula;
Further, the step S6 specifically:
Step S61: the part weight of haplotype, all same lists of summation are formed using the posterior probability of the double type acquired as it Additional weight of the part weight of times type as the haplotype;
Step S62: haplotype and the resulting micro- haplotype of two generation of sample sequencing are compared, if with its run counter to if by the attached of haplotype Weighted value is divided by 5;
In an embodiment of the present invention, further, the step S7 specifically:
Step S71: by the weight of epicycle whole haplotype divided by 2;
Step S72: the haplotype weight by haplotype weight with the addition of corresponding additional weight as next round;
Further, the step S8 specifically:
Step S81: if current iteration number is less than n times, S2 is returned to.
Step S82: it according to the posterior probability of the available each gene loci genotype of the posterior probability of each double type, provides Error probability is as output after genotype.
In order to allow those skilled in the art to better understand technical solution of the present invention, the present invention is carried out below in conjunction with attached drawing It is discussed in detail.
4.1 version of Beagle can make up calculation using the quantity of reference haplotype using millions of haplotypes as group is referred to Approximation during method.But it still can only be used according to the module of gene probabilistic type algorithm (being applied to two generation low powers to be sequenced) double Times type state is calculated and is unable to satisfy on the time so the calculating time of the module is proportional to square with reference to group haplotype quantity Filling up for the task of full-length genome range.This algorithm is compared with Beagle, mainly there is the improvement of following several respects, to make full base Because filling up for group range is possibly realized:
1) state of hidden Markov chain uses always haplotype, and determines that the crowd of haplotype is general with monte carlo method Rate just determines that the probability of the double type of sample is the cartesian product of haplotype probability, thus significantly in last calculating transmitting function Reduce state and runing time.
2) increases when initializing crowd's haplotype clustering model by comparing genotype of the target sample on rare site The weight for the haplotype that adduction target sample is consistent, thus reduce invalid exploration, iteratively faster optimization.
3) reading that is sequenced using two generations covers multiple sites and forms micro- haplotype, effectively eliminates single times of mistake Type assignment.
It in an embodiment of the present invention, will be with this algorithm and other existing Beagle algorithms and based on single window GeneImp algorithm is compared, wherein be the international thousand human genome third editions with reference to haplotype, 26 nationalitys, 2504 samples altogether This, only fills up autosome part:
Using the average value of R^2 as the functional value of MAF, each functional value is general by genotype minimum each of in the library MAF The value of the R^2 of rate is averaging and obtains, our method resulting value and BEAGLE is very nearly the same, and the GeneImp based on single window The average value of R^2 then differ larger with other two kinds.
By R2 average value, runing time and the committed memory of three kinds of methods, specific data are shown in Table 1.As shown in Table 1, base Differ larger with other two methods in the R2 of the GeneImp of single window.And our algorithm is ensuring R2 average value not When in the case where big, significantly shortening runing time, while the CPU occupied when running being also reduced into BEAGLE operation The half of CPU.
The parameter of 1 three kinds of algorithms of table compares
Algorithm R2Average value Runing time Occupy CPU
This algorithm 0.933 8.9 16 concurrent processes
BEAGLE 0.939 136.7 32 concurrent processes
The GeneImp of single window 0.901 1.6 16 concurrent processes
Fig. 2-4 provides how to optimize single times of cluster.Assuming that 4 gene locis altogether, 1 indicates identical with reference to genome, and 2 indicate Different with reference genome, crowd's Haplotype frequencies count as shown in Fig. 2, the single times of cluster directly constructed is as shown in figure 3, here Indicated by the solid line 1, dotted line indicates 2, and the haplotype sum of each edge, available n (e are calculated in conjunction with tableA’)=311, n (eB’)=289, n(ec’)=195, n (eD’)=116, n(eE’)=289, n (eF’)=100, n(eG’)=95, n(eH’)=116, n (eI’)=137, n(eJ;)=152, n(eK’)=21, n(eL)=79, n (em’)=95, n(en’)=116, n(eo’)=25, n (ep’)=112, n(eq’)=152。
Then backward from destination node layering, merge node, the last layer can merge and (count from top to bottom) 1 node and 4 sections Point, 2 nodes and 5 nodes.The second last layer can merge 1 node and 3 nodes.Combined node substantially bifurcated weight ratio is suitable, As shown in figure 4, which reduces number of nodes after merging.
Compare Fig. 3 and Fig. 4 to be not difficult to obtain, the 1st Section 3 point of third layer merges, the Chromosome recombination corresponding to the crowd.
The foregoing is merely presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent with Modification, is all covered by the present invention.

Claims (10)

1. a kind of genotype complementing method for constructing hidden Markov chain, it is characterised in that: the following steps are included:
Step S1: genotypic database is carried out using a small number of allelotypes on rare site that sample to be tested has Pretreatment, obtains pretreated genotypic database;
Step S2: defining and creates local haplotype clustering model;
Step S3: subsequent similar state is merged to the node of local haplotype clustering model;
Step S4: beta pruning is carried out to the access of local haplotype clustering model using the side statistic algorithm of haplotype clustering model;
Step S5: using the haplotype probability for acquiring local haplotype clustering model by Monte Carlo simulation, in conjunction with Hardy- Genotype probability in Weinberg balance and double type site corresponding with sample to be tested acquires double type posterior probability;
Step S6: its additional weight for forming haplotype is acquired by double type posterior probability and the micro- haplotype of sample to be tested;
Step S7: reducing epicycle haplotype weight and acquires lower whorl haplotype weight with the addition of corresponding additional weight;
Step S8: circulation n times step S2-S7, wherein N is preset value, and N >=10, after n times circulation, after obtaining double type It tests probability and obtains the genotype posterior probability of corresponding site.
2. a kind of genotype complementing method for constructing hidden Markov chain according to claim 1, it is characterised in that: described Step S1 specifically:
Step S11: part is carried out using haplotype of the sample to be tested site information to genotypic database and sentences type;
Step S12: using a small number of allelotypes on rare site present in sample to be tested to the haplotype after sentencing type into Row judgement, will possess the weight * u of the haplotype in the rare site, and wherein u is constant.
3. a kind of genotype complementing method for constructing hidden Markov chain according to claim 1, it is characterised in that: described Step S2 specifically:
Step S21: assuming that have M haplotype type in Haplotype data library, the quantity of every kind of haplotype type in the database For im, these haplotypes do not lack in database allele;
Step S22: by haplotype type and quantity in database, local haplotype cluster Clustering Model is obtained.
4. a kind of genotype complementing method of construction hidden Markov chain according to right 3, it is characterised in that: the part Haplotype cluster Clustering Model is directed acyclic graph, which needs to meet following four feature:
1), the figure have one without it is incoming while root node and one without spreading out of while end-node;
2), the figure is consistent in M+1 rank, and each node has a rank m,
The grade of node is 0, and the grade of terminal node is M;
3), m allele label such as each edge use for the child node for being m grades to same grade, derived from the two of same father node Side cannot be marked with the same allele;
4), for each haplotype in database, there is the path from root node to terminal node, so that the m of haplotype A allele is the label on m-th of side in path;And haplotype cluster each edge e indicate by all start nodes from figure to The haplotype set of the haplotype composition of the traversal path e of terminal node.
5. a kind of genotype complementing method based on rare site construction hidden Markov chain according to claim 1, Be characterized in that: the step S3 is specially;
Step S31: calculating score to each pair of node of current level, when score is less than m (nx-1 +ny-1) 1/2+b when, two section Point could merge;
Wherein, m and b is scale parameter and conversion parameter;
Step S32: each cluster for the local haplotype clustering model handled well is defined as biserial puppet label, then to each label Corresponding trait carries out Chi-square Test and calculates probability P value
Step S33: the result of multiple Chi-square Tests is adjusted using permutation test.
6. a kind of genotype complementing method for constructing hidden Markov chain according to claim 7, it is characterised in that: described Step S4 specifically:
Step S41: from the front to the back, by the quantity of the haplotype of each node of level calculation;
Step S42: two sides set out for same node, if weight has been more than 500:1, the small side of weight will be cut off;
Step S43: removal cuts off haplotype corresponding to side, corrects the weight on each side of haplotype clustering model.
7. a kind of genotype complementing method of construction hidden Markov chain according to claim 1, it is characterised in that: described Step S5 specifically:
Step S51: Monte Carlo simulation is carried out to the local haplotype clustering model after beta pruning, is specially chosen at random from the off Select a paths up to terminal, and from the ratio for counting the different access numbers after P simulation, i.e., haplotype is general Rate;
Step S52: haplotype probability is done into cartesian product according to Hardy-Weinburg balance and acquires double type probability;
Step S53: sample to be tested is corresponded into the genotype probability on double type site as transmitting function, with double type probability phase Multiply the ratio for acquiring double type, and acquires the posterior probability of double type by Bayesian formula.
8. a kind of genotype complementing method of construction hidden Markov chain according to claim 1, it is characterised in that: described Step S6 specifically:
Step S61: the part weight of haplotype, all same lists of summation are formed using the posterior probability of the double type acquired as it Additional weight of the part weight of times type as the haplotype;
Step S62: haplotype and the resulting micro- haplotype of two generation of sample sequencing are compared, if with its run counter to if by the attached of haplotype Weighted value is divided by 5.
9. a kind of genotype complementing method of construction hidden Markov chain according to claim, it is characterised in that: described Step S7 specifically:
Step S71: by the weight of epicycle whole haplotype divided by 2;
Step S72: the haplotype weight by haplotype weight with the addition of corresponding additional weight as next round.
10. a kind of genotype complementing method of construction hidden Markov chain according to claim, it is characterised in that: institute State step S8 specifically:
Step S81: if current iteration number is less than n times, S2 is returned to;
Step S82: according to the posterior probability of the available each gene loci genotype of the posterior probability of each double type, gene is provided Error probability is as output after type.
CN201810741480.5A 2018-07-09 2018-07-09 Genotype filling method for constructing hidden Markov chain Active CN109063417B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810741480.5A CN109063417B (en) 2018-07-09 2018-07-09 Genotype filling method for constructing hidden Markov chain

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810741480.5A CN109063417B (en) 2018-07-09 2018-07-09 Genotype filling method for constructing hidden Markov chain

Publications (2)

Publication Number Publication Date
CN109063417A true CN109063417A (en) 2018-12-21
CN109063417B CN109063417B (en) 2022-03-15

Family

ID=64819541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810741480.5A Active CN109063417B (en) 2018-07-09 2018-07-09 Genotype filling method for constructing hidden Markov chain

Country Status (1)

Country Link
CN (1) CN109063417B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445953A (en) * 2020-03-27 2020-07-24 武汉古奥基因科技有限公司 Method for splitting tetraploid fish subgenome by using whole genome comparison
CN112885408A (en) * 2021-02-22 2021-06-01 中国农业大学 Method and device for detecting SNP marker locus based on low-depth sequencing
CN114420205A (en) * 2021-01-29 2022-04-29 杭州联川基因诊断技术有限公司 High-throughput micro-haplotype detection and typing system and method based on next generation sequencing
WO2023246949A1 (en) * 2022-06-24 2023-12-28 厦门万基生物科技有限公司 Non-invasive method for determining parentage before birth by using microhaplotypes
CN117711488A (en) * 2023-11-29 2024-03-15 东莞博奥木华基因科技有限公司 Gene haplotype detection method based on long-reading long-sequencing and application thereof

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030211501A1 (en) * 2001-04-18 2003-11-13 Stephens J. Claiborne Method and system for determining haplotypes from a collection of polymorphisms
US20070184467A1 (en) * 2005-11-26 2007-08-09 Matthew Rabinowitz System and method for cleaning noisy genetic data from target individuals using genetic data from genetically related individuals
CN103745136A (en) * 2013-12-26 2014-04-23 中国农业大学 Efficient haplotype inference and deleted genotype fill method
CN104830967A (en) * 2015-03-10 2015-08-12 中国农业科学院作物科学研究所 Positioning method of rice selected introgression lines QTL
US20160217250A1 (en) * 2015-01-27 2016-07-28 Institut Pasteur Identifying molecular systems in protein sequence data
CN107577918A (en) * 2017-08-22 2018-01-12 山东师范大学 The recognition methods of CpG islands, device based on genetic algorithm and hidden Markov model
EP3343416A1 (en) * 2016-12-27 2018-07-04 Tata Consultancy Services Limited System and method for improved estimation of functional potential of genomes and metagenomes

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030211501A1 (en) * 2001-04-18 2003-11-13 Stephens J. Claiborne Method and system for determining haplotypes from a collection of polymorphisms
US20070184467A1 (en) * 2005-11-26 2007-08-09 Matthew Rabinowitz System and method for cleaning noisy genetic data from target individuals using genetic data from genetically related individuals
CN103745136A (en) * 2013-12-26 2014-04-23 中国农业大学 Efficient haplotype inference and deleted genotype fill method
US20160217250A1 (en) * 2015-01-27 2016-07-28 Institut Pasteur Identifying molecular systems in protein sequence data
CN104830967A (en) * 2015-03-10 2015-08-12 中国农业科学院作物科学研究所 Positioning method of rice selected introgression lines QTL
EP3343416A1 (en) * 2016-12-27 2018-07-04 Tata Consultancy Services Limited System and method for improved estimation of functional potential of genomes and metagenomes
CN107577918A (en) * 2017-08-22 2018-01-12 山东师范大学 The recognition methods of CpG islands, device based on genetic algorithm and hidden Markov model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SHARON R.BROWNING 等: "Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering", 《ARTICLE》 *
周正奎: "全基因组关联分析和全基因组预测法解析犬髋关节疾病", 《中国博士学位论文全文数据库 农业科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111445953A (en) * 2020-03-27 2020-07-24 武汉古奥基因科技有限公司 Method for splitting tetraploid fish subgenome by using whole genome comparison
CN111445953B (en) * 2020-03-27 2022-04-26 武汉古奥基因科技有限公司 Method for splitting tetraploid fish subgenome by using whole genome comparison
CN114420205A (en) * 2021-01-29 2022-04-29 杭州联川基因诊断技术有限公司 High-throughput micro-haplotype detection and typing system and method based on next generation sequencing
CN112885408A (en) * 2021-02-22 2021-06-01 中国农业大学 Method and device for detecting SNP marker locus based on low-depth sequencing
WO2023246949A1 (en) * 2022-06-24 2023-12-28 厦门万基生物科技有限公司 Non-invasive method for determining parentage before birth by using microhaplotypes
CN117711488A (en) * 2023-11-29 2024-03-15 东莞博奥木华基因科技有限公司 Gene haplotype detection method based on long-reading long-sequencing and application thereof

Also Published As

Publication number Publication date
CN109063417B (en) 2022-03-15

Similar Documents

Publication Publication Date Title
Dawson et al. A Bayesian approach to the identification of panmictic populations and the assignment of individuals
CN109063417A (en) A kind of genotype complementing method constructing hidden Markov chain
Thompson Identity by descent: variation in meiosis, across genomes, and in populations
Vernot et al. Reconciliation with non-binary species trees
François et al. Spatially explicit Bayesian clustering models in population genetics
Molinaro et al. Tree-based multivariate regression and density estimation with right-censored data
Thomas et al. Sibship reconstruction in hierarchical population structures using Markov chain Monte Carlo techniques
Keller et al. Recent admixture generates heterozygosity–fitness correlations during the range expansion of an invading species
Sousa et al. Identifying loci under selection against gene flow in isolation-with-migration models
Myers et al. Biogeographic barriers, Pleistocene refugia, and climatic gradients in the southeastern Nearctic drive diversification in cornsnakes (Pantherophis guttatus complex)
Wang et al. Efficient estimation of realized kinship from single nucleotide polymorphism genotypes
Medina et al. Estimating the timing of multiple admixture pulses during local ancestry inference
CN107025384A (en) A kind of construction method of complex data forecast model
Dyer The gstudio package
Banker et al. Hierarchical hybrid enrichment: multitiered genomic data collection across evolutionary scales, with application to chorus frogs (Pseudacris)
US20040106113A1 (en) Prediction of estrogen receptor status of breast tumors using binary prediction tree modeling
DeSaix et al. Population assignment from genotype likelihoods for low‐coverage whole‐genome sequencing data
Banka et al. Evolutionary biclustering of gene expressions
Hem et al. Robust modeling of additive and nonadditive variation with intuitive inclusion of expert knowledge
Slatyer et al. Do different rates of gene flow underlie variation in phenotypic and phenological clines in a montane grasshopper community?
Ko et al. Joint estimation of pedigrees and effective population size using Markov chain Monte Carlo
Zhu et al. Fast variance component analysis using large-scale ancestral recombination graphs
Zarn Genomic Inference of Inbreeding in Alexander Archipelago Wolves (Canis lupus ligoni) on Prince of Wales Island, Southeast Alaska
KR20210080766A (en) Method And System For Constructing Cancer Patient Specific Gene Networks And Finding Prognostic Gene Pairs
Cardin et al. Joint association testing of common and rare genetic variants using hierarchical modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant