CN109063417A - A kind of genotype complementing method constructing hidden Markov chain - Google Patents
A kind of genotype complementing method constructing hidden Markov chain Download PDFInfo
- Publication number
- CN109063417A CN109063417A CN201810741480.5A CN201810741480A CN109063417A CN 109063417 A CN109063417 A CN 109063417A CN 201810741480 A CN201810741480 A CN 201810741480A CN 109063417 A CN109063417 A CN 109063417A
- Authority
- CN
- China
- Prior art keywords
- haplotype
- genotype
- probability
- node
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention relates to a kind of genotype complementing method for constructing hidden Markov chain, a small number of allelotypes on rare site having using sample pre-process genotypic database;Create local haplotype clustering model;Subsequent similar state is merged to the node of local haplotype clustering model;Beta pruning is carried out using access of the forward-backward algorithm to local haplotype clustering model;Using the haplotype probability acquired by Monte Carlo simulation, balances to obtain the double type of sample by hardy-weinberg and first read probability and sequencing likelihood value is combined to acquire double type posterior probability;Local haplotype clustering model is constructed again by reading probability as iteration after the double type of the micro- haplotype screening of sample;N times are recycled, the genotype probability of corresponding site is obtained after amphiploid posterior probability and obtain.The present invention constructs local haplotype cluster, and the posterior probability of double type is generated from Hidden Markov Model, and according to micro- haplotype of sample by clearly contradicted elimination, what iteration can repeatedly obtain high accuracy sentences type.
Description
Technical field
The present invention relates to a kind of genotype complementing methods for constructing hidden Markov chain.
Background technique
Gene is the basic unit of heredity, carries the DNA(DNA of hereditary information) or RNA(ribonucleic acid)
Hereditary information is passed to the next generation by duplication by sequence, instructs the synthesis of protein to express the heredity letter entrained by oneself
Breath, to control the trait expression of bion.Genetic test is to be detected by blood, other body fluid or cell to DNA
Technology, be to take detected person's peripheric venous blood or other histocytes, after expanding its gene information, by particular device to quilt
DNA molecular information in tester's cell detects, and analyzes gene type contained by it and gene defect and its expressive function
A kind of whether normally method.
In the gene order being sequenced, still have moiety site because sequencing depth it is inadequate, site coverage it is not high without
It is detected.In order to speculate the site of these shortage of data, genotype complementing method has been introduced.It is according to that genotype, which is filled up,
The method that the genotype in the site of parting carries out genetype for predicting to the site of shortage of data or the site of non-parting.Genotype
Filling can reduce the cost of direct parting, and the genotype largely lacked for leading to lose is merged to different genotype parting platform
It fills up, may help to the Conjoint Analysis to these data.
Due to HLA(cellular antigens) certain allele at different genes seat are often chained together heredity, and it is chain
Gene be not completely random form haplotype, some genes always more occur together, cause certain haplotypes in group
Higher frequency is presented in body, so as to cause linkage disequilibrium (LD).And LD can be with site range attenuation, if genotype data
It does not consider this positioning when extended area determines phase, then will be made a variation by sampling and introduce error, cause between label at a distance
Apparent correlation is observed, to reduce the accuracy of haplotype deduction.Based on the not linkage equilibrium between site, current base
Because type filling is largely divided into two major classes, one kind is the GWAS(whole-genome association based on general population) in, utilize non-parent
The information for belonging to individual is filled up, and another kind of to be mainly used for the GWAS based on family, the principle of both methods is substantially similar, by
Limited in collecting sample, most fill methods are based on the first kind.
Genotype fill method on the market is mainly the complete genotype information for utilizing reference group to provide, structure at present
The haplotype information of the label to interlock with one another is built out, haplotype information is then utilized, by target group's genotype deletion segment
Information filling is complete.According to whether the method that is filled primarily with proposed at present includes two major classes, and one kind is to utilize using family information
The linkage disequilibrium information architecture haplotype of group, corresponding software have: FAMHAP, fastPHASE, MaCH, Beagle,
PLINK, BIMBAM, IMPUTE2, PHASEBOOK etc..Another kind of is to construct single times using family information and label linkage information
Type, corresponding software have Find-hap, Fimpute, AlphaImpute, PEDIMPUTE etc..According to algorithm difference, and can be by this
A little methods are divided into 3 classes, i.e., based on Markov chain (MCMC) algorithm, simplified leash law (parsimony) and expectation maximization
Algorithm (EM algorithm).
Wherein, Findhap software is the representative of simplified leash law, it being capable of haplotype information according to pedigree or group
Haplotype information carry out deletion segment filling.Its specific practice is, first by the haplotype root according to community information building
It is ranked up, is then split every chromosome as a section by every 100 labels, to some according to Haplotype frequencies
The genotype of each label of each section is compared individual with the genotype on the haplotype to have sorted, if than
Label each to result is all non-opposite homozygote, then determines in the comparison of this site, then by the missing position in this section
Point carries out homozygote filling according to the haplotype of comparison, then compares next section again.It is compared not if there is a certain section
The case where haplotype of upper any section, is added to the last bit of haplotype sequence then the section forms a new haplotype,
And the then all unknown gene types of the high density site in the section.The above filling based on group's haplotype information is carried out
Cheng Hou, then look into pedigree, the haplotype information of the father of the individual and mother are found, and are filled out further correcting the individual
Genotype.It is very fast to simplify leash law arithmetic speed, but not high with respect to for other algorithms in accuracy.PLINK is EM algorithm
It represents, this algorithm
It is widely used in every research of every field, and still very when processing deviates Hardy-Weinberg equilibrium group
Steadily and surely.Its detailed process is divided into E step and M step, and E step is exactly to estimate deletion segment by reference to the high density marker information of group
Desired value, then reuses observation information and estimated information carries out the expectation estimation of next round, until M step, that is, estimates
Desired value reach stable state, the disadvantage of this algorithm maximum is exactly that accuracy is not high.
Meanwhile as the application of high-flux sequence is gradually popularized, haplotype ratio is obtained from sequencing data and utilizes SNP chip
The genotype of middle acquisition is more suitably applied to genotype to some extent and fills up algorithm.Firstly, the sequencing of two generations is polymorphic stage by stage
The density ratio SNP chip test result in property site wants high more, and contains more low frequency sites.Secondly, when sequencing layer
When number is lower, genotype can only be arrived by part detection, and every kind of genotype has a degree of uncertainty, this is depended on
Cover the number of plies in the site.The sequencing of two generations indicates this uncertainty with genotype possibility (GL) in each site, and existing
Genotype fill up in algorithm, Thunder, Beagle, Impute2 etc. can using genotype possibility as input transport
It calculates.
Up to the present, some linkage disequilibriums that are not based on sentence type method all and are with using present in the sequencing of two generations
Haplotype data.The sequencing of two generations has that point is numerous, and is all much rare site, in low power, each site
Genotype inaccuracy, can only be described with probability.The present invention makes full use of two generation low powers that the above attribute is sequenced, and sufficiently excavates for two generations
The potentiality of sequencing data, make it possible the filling up of full-length genome range.
Summary of the invention
In view of this, the purpose of the present invention is to provide a kind of genotype based on rare site construction hidden Markov chain
Complementing method solves the problems, such as that genotype is filled up, and can increase substantially the accuracy rate filled up, and to a certain extent improve operation when
Between.
To achieve the above object, the present invention adopts the following technical scheme:
A kind of genotype complementing method constructing hidden Markov chain, it is characterised in that: the following steps are included:
Step S1: genotypic database is carried out using a small number of allelotypes on rare site that sample to be tested has
Pretreatment, obtains pretreated genotypic database;
Step S2: defining and creates local haplotype clustering model;
Step S3: subsequent similar state is merged to the node of local haplotype clustering model;
Step S4: beta pruning is carried out to the access of local haplotype clustering model using the side statistic algorithm of haplotype clustering model;
Step S5: using the haplotype probability for acquiring local haplotype clustering model by Monte Carlo simulation, in conjunction with Hardy-
Genotype probability in Weinberg balance and double type site corresponding with sample to be tested acquires double type posterior probability;
Step S6: its additional weight for forming haplotype is acquired by double type posterior probability and the micro- haplotype of sample to be tested;
Step S7: reducing epicycle haplotype weight and acquires lower whorl haplotype weight with the addition of corresponding additional weight;
Step S8: circulation n times step S2-S7, wherein N is preset value, and N >=10, after n times circulation, after obtaining double type
It tests probability and obtains the genotype posterior probability of corresponding site.
Further, the step S1 specifically:
Step S11: part is carried out using haplotype of the sample to be tested site information to genotypic database and sentences type;
Step S12: using a small number of allelotypes on rare site present in sample to be tested to the haplotype after sentencing type into
Row judgement, will possess the weight * u of the haplotype in the rare site, and wherein u is constant;
Further, the step S2 specifically:
Step S21: assuming that have M haplotype type in Haplotype data library, the quantity of every kind of haplotype type in the database
For im, these haplotypes do not lack in database allele;
Step S22: by haplotype type and quantity in database, local haplotype cluster Clustering Model is obtained.
Further, the local haplotype cluster Clustering Model is directed acyclic graph, which needs to meet following four spy
Sign:
1), the figure have one without it is incoming while root node and one without spreading out of while end-node;
2), the figure is consistent in M+1 rank, and each node has a rank m,
The grade of node is 0, and the grade of terminal node is M;
3), m allele label such as each edge use for the child node for being m grades to same grade, derived from the two of same father node
Side cannot be marked with the same allele;
4), for each haplotype in database, there is the path from root node to terminal node, so that the m of haplotype
A allele is the label on m-th of side in path;And haplotype cluster each edge e indicate by all start nodes from figure to
The haplotype set of the haplotype composition of the traversal path e of terminal node.
Further, the step S3 is specially;
Step S31: calculating score to each pair of node of current level, when score is less than m (nx-1 +ny-1) 1/2+b when, two section
Point could merge;
Wherein, m and b is scale parameter and conversion parameter;
Step S32: each cluster for the local haplotype clustering model handled well is defined as biserial puppet label, then to each label
Corresponding trait carries out Chi-square Test and calculates probability P value
Step S33: the result of multiple Chi-square Tests is adjusted using permutation test.
Further, the step S4 specifically:
Step S41: from the front to the back, by the quantity of the haplotype of each node of level calculation;
Step S42: two sides set out for same node, if weight has been more than 500:1, the small side of weight will be cut off;
Step S43: removal cuts off haplotype corresponding to side, corrects the weight on each side of haplotype clustering model;
Further, the step S5 specifically:
Step S51: Monte Carlo simulation is carried out to the local haplotype clustering model after beta pruning, is specially chosen at random from the off
Select a paths up to terminal, and from the ratio for counting the different access numbers after P simulation, i.e., haplotype is general
Rate;
Step S52: haplotype probability is done into cartesian product according to Hardy-Weinburg balance and acquires double type probability;
Step S53: sample to be tested is corresponded into the genotype probability on double type site as transmitting function, with double type probability phase
Multiply the ratio for acquiring double type, and acquires the posterior probability of double type by Bayesian formula;
Further, the step S6 specifically:
Step S61: the part weight of haplotype, all same lists of summation are formed using the posterior probability of the double type acquired as it
Additional weight of the part weight of times type as the haplotype;
Step S62: haplotype and the resulting micro- haplotype of two generation of sample sequencing are compared, if with its run counter to if by the attached of haplotype
Weighted value is divided by 5;
Further, the step S7 specifically:
Step S71: by the weight of epicycle whole haplotype divided by 2;
Step S72: the haplotype weight by haplotype weight with the addition of corresponding additional weight as next round;
Further, the step S8 specifically:
Step S81: if current iteration number is less than n times, S2 is returned to.
Step S82: it according to the posterior probability of the available each gene loci genotype of the posterior probability of each double type, provides
Error probability is as output after genotype.
Compared with the prior art, the invention has the following beneficial effects:
1, the present invention uses the sample pool of million grades of big datas as the observation state of Hidden Markov Model, and it is accurate to increase algorithm
Degree;
2, the node sets that the present invention calculates hidden Markov model are the label cluster that Genotyping is carried out in target data,
Reduce storage requirement and calculates the time;
3, the present invention makes database be reduced to number by million grades before filling using rare site (MAF < 1%) as seed
Hundred, to effectively exponentially reduce Riming time of algorithm in the case where not damaging precision;
4, the present invention screens haplotype using the part Haplotype data that two generations were sequenced in the sampling of each iterative process,
To improve the accuracy for the haplotype for sentencing type by genotype data, it is also further reduced operation time.
Detailed description of the invention
Fig. 1 is flow chart of the present invention
Fig. 2 is one embodiment of the invention datagram
Fig. 3 is one embodiment of the invention schematic diagram
Fig. 4 is another embodiment of the present invention schematic diagram.
Specific embodiment
The present invention will be further described with reference to the accompanying drawings and embodiments.
Fig. 1 is please referred to, the present invention provides a kind of genotype complementing method for constructing hidden Markov chain, it is characterised in that:
The following steps are included:
Step S1: genotypic database is carried out using a small number of allelotypes on rare site that sample to be tested has
Pretreatment, obtains pretreated genotypic database;
Step S2: defining and creates local haplotype clustering model;
Step S3: subsequent similar state is merged to the node of local haplotype clustering model;
Step S4: beta pruning is carried out to the access of local haplotype clustering model using the side statistic algorithm of haplotype clustering model;
Step S5: using the haplotype probability for acquiring local haplotype clustering model by Monte Carlo simulation, in conjunction with Hardy-
Genotype probability in Weinberg balance and double type site corresponding with sample to be tested acquires double type posterior probability;
Step S6: its additional weight for forming haplotype is acquired by double type posterior probability and the micro- haplotype of sample to be tested;
Step S7: reducing epicycle haplotype weight and acquires lower whorl haplotype weight with the addition of corresponding additional weight;
Step S8: circulation n times step S2-S7, wherein N is preset value, and N >=10, after n times circulation, after obtaining double type
It tests probability and obtains the genotype posterior probability of corresponding site.
In an embodiment of the present invention, the step S1 specifically:
Step S11: part is carried out using haplotype of the sample to be tested site information to genotypic database and sentences type;
Step S12: using a small number of allelotypes on rare site present in sample to be tested to the haplotype after sentencing type into
Row judgement, will possess the weight * u of the haplotype in the rare site, and wherein u is constant;
In an embodiment of the present invention, further, the step S2 specifically:
Step S21: assuming that have M haplotype type in Haplotype data library, the quantity of every kind of haplotype type in the database
For im, these haplotypes do not lack in database allele;
Step S22: by haplotype type and quantity in database, local haplotype cluster Clustering Model is obtained.
In an embodiment of the present invention, further, the local haplotype cluster Clustering Model is directed acyclic graph, the figure
Need to meet following four feature:
1), the figure have one without it is incoming while root node and one without spreading out of while end-node;
2), the figure is consistent in M+1 rank, and each node has a rank m,
The grade of node is 0, and the grade of terminal node is M;
3), m allele label such as each edge use for the child node for being m grades to same grade, derived from the two of same father node
Side cannot be marked with the same allele;
4), for each haplotype in database, there is the path from root node to terminal node, so that the m of haplotype
A allele is the label on m-th of side in path;And haplotype cluster each edge e indicate by all start nodes from figure to
The haplotype set of the haplotype composition of the traversal path e of terminal node.
In an embodiment of the present invention, further, the step S3 is specially;
Step S31: calculating score to each pair of node of current level, when score is less than m (nx-1 +ny-1) 1/2+b when, two section
Point could merge;
Wherein, m and b is scale parameter and conversion parameter;
Step S32: each cluster for the local haplotype clustering model handled well is defined as biserial puppet label, then to each label
Corresponding trait carries out Chi-square Test and calculates probability P value
Step S33: the result of multiple Chi-square Tests is adjusted using permutation test.
In an embodiment of the present invention, further, the step S4 specifically:
Step S41: from the front to the back, by the quantity of the haplotype of each node of level calculation;
Step S42: two sides set out for same node, if weight has been more than 500:1, the small side of weight will be cut off;
Step S43: removal cuts off haplotype corresponding to side, corrects the weight on each side of haplotype clustering model;
In an embodiment of the present invention, further, the step S5 specifically:
Step S51: Monte Carlo simulation is carried out to the local haplotype clustering model after beta pruning, is specially chosen at random from the off
Select a paths up to terminal, and from the ratio for counting the different access numbers after P simulation, i.e., haplotype is general
Rate;
Step S52: haplotype probability is done into cartesian product according to Hardy-Weinburg balance and acquires double type probability;
Step S53: sample to be tested is corresponded into the genotype probability on double type site as transmitting function, with double type probability phase
Multiply the ratio for acquiring double type, and acquires the posterior probability of double type by Bayesian formula;
Further, the step S6 specifically:
Step S61: the part weight of haplotype, all same lists of summation are formed using the posterior probability of the double type acquired as it
Additional weight of the part weight of times type as the haplotype;
Step S62: haplotype and the resulting micro- haplotype of two generation of sample sequencing are compared, if with its run counter to if by the attached of haplotype
Weighted value is divided by 5;
In an embodiment of the present invention, further, the step S7 specifically:
Step S71: by the weight of epicycle whole haplotype divided by 2;
Step S72: the haplotype weight by haplotype weight with the addition of corresponding additional weight as next round;
Further, the step S8 specifically:
Step S81: if current iteration number is less than n times, S2 is returned to.
Step S82: it according to the posterior probability of the available each gene loci genotype of the posterior probability of each double type, provides
Error probability is as output after genotype.
In order to allow those skilled in the art to better understand technical solution of the present invention, the present invention is carried out below in conjunction with attached drawing
It is discussed in detail.
4.1 version of Beagle can make up calculation using the quantity of reference haplotype using millions of haplotypes as group is referred to
Approximation during method.But it still can only be used according to the module of gene probabilistic type algorithm (being applied to two generation low powers to be sequenced) double
Times type state is calculated and is unable to satisfy on the time so the calculating time of the module is proportional to square with reference to group haplotype quantity
Filling up for the task of full-length genome range.This algorithm is compared with Beagle, mainly there is the improvement of following several respects, to make full base
Because filling up for group range is possibly realized:
1) state of hidden Markov chain uses always haplotype, and determines that the crowd of haplotype is general with monte carlo method
Rate just determines that the probability of the double type of sample is the cartesian product of haplotype probability, thus significantly in last calculating transmitting function
Reduce state and runing time.
2) increases when initializing crowd's haplotype clustering model by comparing genotype of the target sample on rare site
The weight for the haplotype that adduction target sample is consistent, thus reduce invalid exploration, iteratively faster optimization.
3) reading that is sequenced using two generations covers multiple sites and forms micro- haplotype, effectively eliminates single times of mistake
Type assignment.
It in an embodiment of the present invention, will be with this algorithm and other existing Beagle algorithms and based on single window
GeneImp algorithm is compared, wherein be the international thousand human genome third editions with reference to haplotype, 26 nationalitys, 2504 samples altogether
This, only fills up autosome part:
Using the average value of R^2 as the functional value of MAF, each functional value is general by genotype minimum each of in the library MAF
The value of the R^2 of rate is averaging and obtains, our method resulting value and BEAGLE is very nearly the same, and the GeneImp based on single window
The average value of R^2 then differ larger with other two kinds.
By R2 average value, runing time and the committed memory of three kinds of methods, specific data are shown in Table 1.As shown in Table 1, base
Differ larger with other two methods in the R2 of the GeneImp of single window.And our algorithm is ensuring R2 average value not
When in the case where big, significantly shortening runing time, while the CPU occupied when running being also reduced into BEAGLE operation
The half of CPU.
The parameter of 1 three kinds of algorithms of table compares
Algorithm | R2Average value | Runing time | Occupy CPU |
This algorithm | 0.933 | 8.9 | 16 concurrent processes |
BEAGLE | 0.939 | 136.7 | 32 concurrent processes |
The GeneImp of single window | 0.901 | 1.6 | 16 concurrent processes |
Fig. 2-4 provides how to optimize single times of cluster.Assuming that 4 gene locis altogether, 1 indicates identical with reference to genome, and 2 indicate
Different with reference genome, crowd's Haplotype frequencies count as shown in Fig. 2, the single times of cluster directly constructed is as shown in figure 3, here
Indicated by the solid line 1, dotted line indicates 2, and the haplotype sum of each edge, available n (e are calculated in conjunction with tableA’)=311, n
(eB’)=289, n(ec’)=195, n (eD’)=116, n(eE’)=289, n (eF’)=100, n(eG’)=95, n(eH’)=116, n
(eI’)=137, n(eJ;)=152, n(eK’)=21, n(eL)=79, n (em’)=95, n(en’)=116, n(eo’)=25, n
(ep’)=112, n(eq’)=152。
Then backward from destination node layering, merge node, the last layer can merge and (count from top to bottom) 1 node and 4 sections
Point, 2 nodes and 5 nodes.The second last layer can merge 1 node and 3 nodes.Combined node substantially bifurcated weight ratio is suitable,
As shown in figure 4, which reduces number of nodes after merging.
Compare Fig. 3 and Fig. 4 to be not difficult to obtain, the 1st Section 3 point of third layer merges, the Chromosome recombination corresponding to the crowd.
The foregoing is merely presently preferred embodiments of the present invention, all equivalent changes done according to scope of the present invention patent with
Modification, is all covered by the present invention.
Claims (10)
1. a kind of genotype complementing method for constructing hidden Markov chain, it is characterised in that: the following steps are included:
Step S1: genotypic database is carried out using a small number of allelotypes on rare site that sample to be tested has
Pretreatment, obtains pretreated genotypic database;
Step S2: defining and creates local haplotype clustering model;
Step S3: subsequent similar state is merged to the node of local haplotype clustering model;
Step S4: beta pruning is carried out to the access of local haplotype clustering model using the side statistic algorithm of haplotype clustering model;
Step S5: using the haplotype probability for acquiring local haplotype clustering model by Monte Carlo simulation, in conjunction with Hardy-
Genotype probability in Weinberg balance and double type site corresponding with sample to be tested acquires double type posterior probability;
Step S6: its additional weight for forming haplotype is acquired by double type posterior probability and the micro- haplotype of sample to be tested;
Step S7: reducing epicycle haplotype weight and acquires lower whorl haplotype weight with the addition of corresponding additional weight;
Step S8: circulation n times step S2-S7, wherein N is preset value, and N >=10, after n times circulation, after obtaining double type
It tests probability and obtains the genotype posterior probability of corresponding site.
2. a kind of genotype complementing method for constructing hidden Markov chain according to claim 1, it is characterised in that: described
Step S1 specifically:
Step S11: part is carried out using haplotype of the sample to be tested site information to genotypic database and sentences type;
Step S12: using a small number of allelotypes on rare site present in sample to be tested to the haplotype after sentencing type into
Row judgement, will possess the weight * u of the haplotype in the rare site, and wherein u is constant.
3. a kind of genotype complementing method for constructing hidden Markov chain according to claim 1, it is characterised in that: described
Step S2 specifically:
Step S21: assuming that have M haplotype type in Haplotype data library, the quantity of every kind of haplotype type in the database
For im, these haplotypes do not lack in database allele;
Step S22: by haplotype type and quantity in database, local haplotype cluster Clustering Model is obtained.
4. a kind of genotype complementing method of construction hidden Markov chain according to right 3, it is characterised in that: the part
Haplotype cluster Clustering Model is directed acyclic graph, which needs to meet following four feature:
1), the figure have one without it is incoming while root node and one without spreading out of while end-node;
2), the figure is consistent in M+1 rank, and each node has a rank m,
The grade of node is 0, and the grade of terminal node is M;
3), m allele label such as each edge use for the child node for being m grades to same grade, derived from the two of same father node
Side cannot be marked with the same allele;
4), for each haplotype in database, there is the path from root node to terminal node, so that the m of haplotype
A allele is the label on m-th of side in path;And haplotype cluster each edge e indicate by all start nodes from figure to
The haplotype set of the haplotype composition of the traversal path e of terminal node.
5. a kind of genotype complementing method based on rare site construction hidden Markov chain according to claim 1,
Be characterized in that: the step S3 is specially;
Step S31: calculating score to each pair of node of current level, when score is less than m (nx-1 +ny-1) 1/2+b when, two section
Point could merge;
Wherein, m and b is scale parameter and conversion parameter;
Step S32: each cluster for the local haplotype clustering model handled well is defined as biserial puppet label, then to each label
Corresponding trait carries out Chi-square Test and calculates probability P value
Step S33: the result of multiple Chi-square Tests is adjusted using permutation test.
6. a kind of genotype complementing method for constructing hidden Markov chain according to claim 7, it is characterised in that: described
Step S4 specifically:
Step S41: from the front to the back, by the quantity of the haplotype of each node of level calculation;
Step S42: two sides set out for same node, if weight has been more than 500:1, the small side of weight will be cut off;
Step S43: removal cuts off haplotype corresponding to side, corrects the weight on each side of haplotype clustering model.
7. a kind of genotype complementing method of construction hidden Markov chain according to claim 1, it is characterised in that: described
Step S5 specifically:
Step S51: Monte Carlo simulation is carried out to the local haplotype clustering model after beta pruning, is specially chosen at random from the off
Select a paths up to terminal, and from the ratio for counting the different access numbers after P simulation, i.e., haplotype is general
Rate;
Step S52: haplotype probability is done into cartesian product according to Hardy-Weinburg balance and acquires double type probability;
Step S53: sample to be tested is corresponded into the genotype probability on double type site as transmitting function, with double type probability phase
Multiply the ratio for acquiring double type, and acquires the posterior probability of double type by Bayesian formula.
8. a kind of genotype complementing method of construction hidden Markov chain according to claim 1, it is characterised in that: described
Step S6 specifically:
Step S61: the part weight of haplotype, all same lists of summation are formed using the posterior probability of the double type acquired as it
Additional weight of the part weight of times type as the haplotype;
Step S62: haplotype and the resulting micro- haplotype of two generation of sample sequencing are compared, if with its run counter to if by the attached of haplotype
Weighted value is divided by 5.
9. a kind of genotype complementing method of construction hidden Markov chain according to claim, it is characterised in that: described
Step S7 specifically:
Step S71: by the weight of epicycle whole haplotype divided by 2;
Step S72: the haplotype weight by haplotype weight with the addition of corresponding additional weight as next round.
10. a kind of genotype complementing method of construction hidden Markov chain according to claim, it is characterised in that: institute
State step S8 specifically:
Step S81: if current iteration number is less than n times, S2 is returned to;
Step S82: according to the posterior probability of the available each gene loci genotype of the posterior probability of each double type, gene is provided
Error probability is as output after type.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810741480.5A CN109063417B (en) | 2018-07-09 | 2018-07-09 | Genotype filling method for constructing hidden Markov chain |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810741480.5A CN109063417B (en) | 2018-07-09 | 2018-07-09 | Genotype filling method for constructing hidden Markov chain |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109063417A true CN109063417A (en) | 2018-12-21 |
CN109063417B CN109063417B (en) | 2022-03-15 |
Family
ID=64819541
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810741480.5A Active CN109063417B (en) | 2018-07-09 | 2018-07-09 | Genotype filling method for constructing hidden Markov chain |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109063417B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111445953A (en) * | 2020-03-27 | 2020-07-24 | 武汉古奥基因科技有限公司 | Method for splitting tetraploid fish subgenome by using whole genome comparison |
CN112885408A (en) * | 2021-02-22 | 2021-06-01 | 中国农业大学 | Method and device for detecting SNP marker locus based on low-depth sequencing |
CN114420205A (en) * | 2021-01-29 | 2022-04-29 | 杭州联川基因诊断技术有限公司 | High-throughput micro-haplotype detection and typing system and method based on next generation sequencing |
WO2023246949A1 (en) * | 2022-06-24 | 2023-12-28 | 厦门万基生物科技有限公司 | Non-invasive method for determining parentage before birth by using microhaplotypes |
CN117711488A (en) * | 2023-11-29 | 2024-03-15 | 东莞博奥木华基因科技有限公司 | Gene haplotype detection method based on long-reading long-sequencing and application thereof |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030211501A1 (en) * | 2001-04-18 | 2003-11-13 | Stephens J. Claiborne | Method and system for determining haplotypes from a collection of polymorphisms |
US20070184467A1 (en) * | 2005-11-26 | 2007-08-09 | Matthew Rabinowitz | System and method for cleaning noisy genetic data from target individuals using genetic data from genetically related individuals |
CN103745136A (en) * | 2013-12-26 | 2014-04-23 | 中国农业大学 | Efficient haplotype inference and deleted genotype fill method |
CN104830967A (en) * | 2015-03-10 | 2015-08-12 | 中国农业科学院作物科学研究所 | Positioning method of rice selected introgression lines QTL |
US20160217250A1 (en) * | 2015-01-27 | 2016-07-28 | Institut Pasteur | Identifying molecular systems in protein sequence data |
CN107577918A (en) * | 2017-08-22 | 2018-01-12 | 山东师范大学 | The recognition methods of CpG islands, device based on genetic algorithm and hidden Markov model |
EP3343416A1 (en) * | 2016-12-27 | 2018-07-04 | Tata Consultancy Services Limited | System and method for improved estimation of functional potential of genomes and metagenomes |
-
2018
- 2018-07-09 CN CN201810741480.5A patent/CN109063417B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030211501A1 (en) * | 2001-04-18 | 2003-11-13 | Stephens J. Claiborne | Method and system for determining haplotypes from a collection of polymorphisms |
US20070184467A1 (en) * | 2005-11-26 | 2007-08-09 | Matthew Rabinowitz | System and method for cleaning noisy genetic data from target individuals using genetic data from genetically related individuals |
CN103745136A (en) * | 2013-12-26 | 2014-04-23 | 中国农业大学 | Efficient haplotype inference and deleted genotype fill method |
US20160217250A1 (en) * | 2015-01-27 | 2016-07-28 | Institut Pasteur | Identifying molecular systems in protein sequence data |
CN104830967A (en) * | 2015-03-10 | 2015-08-12 | 中国农业科学院作物科学研究所 | Positioning method of rice selected introgression lines QTL |
EP3343416A1 (en) * | 2016-12-27 | 2018-07-04 | Tata Consultancy Services Limited | System and method for improved estimation of functional potential of genomes and metagenomes |
CN107577918A (en) * | 2017-08-22 | 2018-01-12 | 山东师范大学 | The recognition methods of CpG islands, device based on genetic algorithm and hidden Markov model |
Non-Patent Citations (2)
Title |
---|
SHARON R.BROWNING 等: "Rapid and Accurate Haplotype Phasing and Missing-Data Inference for Whole-Genome Association Studies By Use of Localized Haplotype Clustering", 《ARTICLE》 * |
周正奎: "全基因组关联分析和全基因组预测法解析犬髋关节疾病", 《中国博士学位论文全文数据库 农业科技辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111445953A (en) * | 2020-03-27 | 2020-07-24 | 武汉古奥基因科技有限公司 | Method for splitting tetraploid fish subgenome by using whole genome comparison |
CN111445953B (en) * | 2020-03-27 | 2022-04-26 | 武汉古奥基因科技有限公司 | Method for splitting tetraploid fish subgenome by using whole genome comparison |
CN114420205A (en) * | 2021-01-29 | 2022-04-29 | 杭州联川基因诊断技术有限公司 | High-throughput micro-haplotype detection and typing system and method based on next generation sequencing |
CN112885408A (en) * | 2021-02-22 | 2021-06-01 | 中国农业大学 | Method and device for detecting SNP marker locus based on low-depth sequencing |
WO2023246949A1 (en) * | 2022-06-24 | 2023-12-28 | 厦门万基生物科技有限公司 | Non-invasive method for determining parentage before birth by using microhaplotypes |
CN117711488A (en) * | 2023-11-29 | 2024-03-15 | 东莞博奥木华基因科技有限公司 | Gene haplotype detection method based on long-reading long-sequencing and application thereof |
Also Published As
Publication number | Publication date |
---|---|
CN109063417B (en) | 2022-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Dawson et al. | A Bayesian approach to the identification of panmictic populations and the assignment of individuals | |
CN109063417A (en) | A kind of genotype complementing method constructing hidden Markov chain | |
Thompson | Identity by descent: variation in meiosis, across genomes, and in populations | |
Vernot et al. | Reconciliation with non-binary species trees | |
François et al. | Spatially explicit Bayesian clustering models in population genetics | |
Molinaro et al. | Tree-based multivariate regression and density estimation with right-censored data | |
Thomas et al. | Sibship reconstruction in hierarchical population structures using Markov chain Monte Carlo techniques | |
Keller et al. | Recent admixture generates heterozygosity–fitness correlations during the range expansion of an invading species | |
Sousa et al. | Identifying loci under selection against gene flow in isolation-with-migration models | |
Myers et al. | Biogeographic barriers, Pleistocene refugia, and climatic gradients in the southeastern Nearctic drive diversification in cornsnakes (Pantherophis guttatus complex) | |
Wang et al. | Efficient estimation of realized kinship from single nucleotide polymorphism genotypes | |
Medina et al. | Estimating the timing of multiple admixture pulses during local ancestry inference | |
CN107025384A (en) | A kind of construction method of complex data forecast model | |
Dyer | The gstudio package | |
Banker et al. | Hierarchical hybrid enrichment: multitiered genomic data collection across evolutionary scales, with application to chorus frogs (Pseudacris) | |
US20040106113A1 (en) | Prediction of estrogen receptor status of breast tumors using binary prediction tree modeling | |
DeSaix et al. | Population assignment from genotype likelihoods for low‐coverage whole‐genome sequencing data | |
Banka et al. | Evolutionary biclustering of gene expressions | |
Hem et al. | Robust modeling of additive and nonadditive variation with intuitive inclusion of expert knowledge | |
Slatyer et al. | Do different rates of gene flow underlie variation in phenotypic and phenological clines in a montane grasshopper community? | |
Ko et al. | Joint estimation of pedigrees and effective population size using Markov chain Monte Carlo | |
Zhu et al. | Fast variance component analysis using large-scale ancestral recombination graphs | |
Zarn | Genomic Inference of Inbreeding in Alexander Archipelago Wolves (Canis lupus ligoni) on Prince of Wales Island, Southeast Alaska | |
KR20210080766A (en) | Method And System For Constructing Cancer Patient Specific Gene Networks And Finding Prognostic Gene Pairs | |
Cardin et al. | Joint association testing of common and rare genetic variants using hierarchical modeling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |