CN109273052A - A kind of genome monoploid assembling method and device - Google Patents
A kind of genome monoploid assembling method and device Download PDFInfo
- Publication number
- CN109273052A CN109273052A CN201811069322.6A CN201811069322A CN109273052A CN 109273052 A CN109273052 A CN 109273052A CN 201811069322 A CN201811069322 A CN 201811069322A CN 109273052 A CN109273052 A CN 109273052A
- Authority
- CN
- China
- Prior art keywords
- genome
- information
- snp
- data
- monoploid
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present invention provides a kind of genome monoploid assembling method and device, which comprises obtains the SNP information for referring to genome;The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;The integration phasing information of full-length genome is obtained according to the phasing information;According to the monoploid source integrated phasing information and distinguish the sequencing fragment of the PacBio data, and the assembling of genome monoploid is carried out respectively.Described device executes the above method.Genome monoploid provided in an embodiment of the present invention assembles method and device, the monoploid source for distinguishing the sequencing fragment of PacBio data by integrating phasing information, and the assembling of genome monoploid is carried out respectively, it can be realized the monoploid assembling of full-length genome, and improve the fine degree of the genome sequence assembled.
Description
Technical field
The present embodiments relate to gene engineering technology fields, and in particular to a kind of genome monoploid assemble method and dress
It sets.
Background technique
With the continuous development of sequencing technologies, genome package technique is also being constantly improve.It is surveyed by the sequencing of two generations to three generations
Sequence, in the case that sequencing fragment reads is increasingly grown, the effect of genome assembling is also become better and better.Current genome assembling
There are mainly three types of algorithms, de-bruijn-graph (DBG), overlap-layout-consensus (OLC) and string
Graph, but no matter which kind of method all only assembles a set of genome of monoploid size for diploid gene group, and
Centre is not distinguish homologue.Such case is mainly limited by current technology condition, due to homologous dye
Similitude is high between colour solid, and sequencing fragment reads curtailment is with across the same clip between homologue, thus can not be true
The phase relation of fixed front and back difference.
Due to the presence of such case, having many softwares at present all is that (mononucleotide is more for the SNP between homologue
The abbreviation of state property) phasing (determining phase) is carried out, phase relation is determined, such as SNPHap, SHAPEIT, WhatsHap and HapCut
Deng.Wherein WhatsHap can carry out phasing to SNP with three generations PacBio data, more many than two generation effect promotings,
The haplotype block length that phasing is obtained is significantly increased, and quantity is reduced.Nonetheless, for certain homozygous regions,
Reads still can not be across.HapCut then can carry out phasing to Hi-C data, distinguish haplotype in Chromosome level, but
It is limited to the principle of Hi-C sequencing, the SNP of phasing is distributed sparse in the genome.The composite software of three generations's PacBio data
Falcon has a subsequent analysis tool Falcon-unzip, can use the variation such as SNP, SV for analyzing and in assembling process
There is the region of variation to the result of assembling, re-assembly out monoploid in information, certain species gene groups are assembled also effective
Fruit.But likewise, being limited to data reads length, similar with WhatsHap is to assemble a section monoploid section, section it
Between phase relation still cannot be distinguished.
Therefore, how drawbacks described above is avoided, realizes the monoploid assembling of full-length genome, and improve the genome sequence assembled
The fine degree of column, becoming need solve the problems, such as.
Summary of the invention
In view of the problems of the existing technology, the embodiment of the present invention provides a kind of genome monoploid assemble method and dress
It sets.
In a first aspect, the embodiment of the present invention provides a kind of genome monoploid assemble method, which comprises
Obtain the SNP information for referring to genome;
The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;
The integration phasing information of full-length genome is obtained according to the phasing information;
According to the monoploid source integrated phasing information and distinguish the sequencing fragment of the PacBio data, and respectively into
The assembling of row genome monoploid.
Second aspect, the embodiment of the present invention provide a kind of genome haplotype group assembling device, and described device includes:
First acquisition unit, for obtaining the SNP information for referring to genome;
Second acquisition unit is believed for the fixed of SNP information described in PacBio data and Hi-C data acquisition to be respectively adopted
Breath;
Third acquiring unit, for obtaining the integration phasing information of full-length genome according to the phasing information;
Assembling unit, for according to the monoploid integrated phasing information and distinguish the sequencing fragment of the PacBio data
Source, and the assembling of genome monoploid is carried out respectively.
The third aspect, the embodiment of the present invention provide a kind of electronic equipment, comprising: processor, memory and bus, wherein
The processor and the memory complete mutual communication by the bus;
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to refer to
Order is able to carry out following method:
Obtain the SNP information for referring to genome;
The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;
The integration phasing information of full-length genome is obtained according to the phasing information;
According to the monoploid source integrated phasing information and distinguish the sequencing fragment of the PacBio data, and respectively into
The assembling of row genome monoploid.
Fourth aspect, the embodiment of the present invention provide a kind of non-transient computer readable storage medium, comprising:
The non-transient computer readable storage medium stores computer instruction, and the computer instruction makes the computer
Execute following method:
Obtain the SNP information for referring to genome;
The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;
The integration phasing information of full-length genome is obtained according to the phasing information;
According to the monoploid source integrated phasing information and distinguish the sequencing fragment of the PacBio data, and respectively into
The assembling of row genome monoploid.
Genome monoploid provided in an embodiment of the present invention assembles method and device, is distinguished by integrating phasing information
The monoploid source of the sequencing fragment of PacBio data, and the assembling of genome monoploid is carried out respectively, it can be realized full-length genome
Monoploid assembling, and improve the fine degree of genome sequence assembled.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair
Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root
Other attached drawings are obtained according to these attached drawings.
Fig. 1 is that genome of embodiment of the present invention monoploid assembles method flow schematic diagram;
Fig. 2 is the first phasing information of PacBio of embodiment of the present invention data;
Fig. 3 is the second phasing information of Hi-C of embodiment of the present invention data;
Fig. 4 is the haplotype result of full-length genome of embodiment of the present invention Chromosome level;
Fig. 5 is to obtain two sets of haploid result figures after the embodiment of the present invention distinguishes haplotype;
Fig. 6 is genome of embodiment of the present invention haplotype group assembling device structural schematic diagram;
Fig. 7 is electronic equipment entity structure schematic diagram provided in an embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
Fig. 1 is that genome of embodiment of the present invention monoploid assembles method flow schematic diagram, as shown in Figure 1, the present invention is implemented
A kind of genome monoploid assemble method that example provides, comprising the following steps:
S101: the SNP information for referring to genome is obtained.
Specifically, device obtains the SNP information for referring to genome.Device can be understood as executing the equipment etc. of this method.
It specifically can be such that using high-flux sequence data with described with reference to genome alignment;Base is referred to described based on comparison result
Because of group progress SNP calling (SNP calling), to obtain the SNP information.It is possible to further use bwa by reads ratio
To on reference genome, the SNP information with reference to genome is obtained by samtools.
S102: the phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted.
Specifically, the phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted in device.Specifically
It can be such that using the PacBio data with described with reference to genome alignment;It is compared by the first of the PacBio data
Information carries out determining phase to the SNP information, to obtain the first phasing information of the PacBio data;It is possible to further make
With blasr by PacBio comparing to reference genome, by the first comparison information with WhatsHap to reference genome into
Row SNP phasing (SNP determines phase).
Genome alignment is referred to described using the Hi-C data;Pass through the second comparison information pair of the Hi-C data
The SNP information carries out determining phase, to obtain the second phasing information of the Hi-C data.It is possible to further use bwa will
Second comparison information is carried out SNP phasing to reference genome with HapCut2 to genome is referred to by Hi-C comparing
(SNP determines phase).Fig. 2 is the first phasing information of PacBio of embodiment of the present invention data, and Fig. 3 is Hi-C of embodiment of the present invention data
The second phasing information, as shown in Figures 2 and 3, the SNP of result that PacBio data phasing SNP is obtained covering is more continuous,
But monoploid block is smaller;The result span that Hi-C data phasing SNP is obtained is larger, but the SNP of centre phasing is diluter
It dredges.
S103: the integration phasing information of full-length genome is obtained according to the phasing information.
Specifically, device obtains the integration phasing information of full-length genome according to the phasing information.It specifically can be such that logical
The second phasing information for crossing the Hi-C data is attached the first phasing information of the PacBio data;Pass through described
Two phasing informations carry out error correction to first phasing information, described integrate phasing information to obtain.Fig. 4 is the embodiment of the present invention
The haplotype of full-length genome Chromosome level is as a result, as shown in figure 4, each chromosome has a list across whole chromosome
Times body block, contains most of SNP on chromosome, such as 196 to the 47 of Lachesis_group0,588,925 blocks,
Contain 252,046 variant sites;The 1 of Lachesis_group1,784 to 41,365,984, contain 178,242 changes
Ectopic sites.And it is remaining be all individual sites composition block of cells.
S104: according to the monoploid source integrated phasing information and distinguish the sequencing fragment of the PacBio data, and
The assembling of genome monoploid is carried out respectively.
Specifically, device according to it is described integrate phasing information distinguish the PacBio data sequencing fragment monoploid come
Source, and the assembling of genome monoploid is carried out respectively.Specifically can be such that and integrate phasing information according to described, determine described in
Haplotype belonging to every sequencing fragment of PacBio data.Further, all SNP in every sequencing fragment are obtained, often
A SNP is corresponding with specified haplotype;The corresponding SNP number of all specified haplotypes is obtained, by the total number with all SNP
Ratio be greater than preset ratio the corresponding specified haplotype of SNP number as affiliated haplotype.Preset ratio can basis
Actual conditions are independently arranged, and are chosen as 0.7, are illustrated below: all SNP of certain sequencing fragment have 4, respectively SNPA,
SNPB, SNPC, SNPD, wherein the corresponding specified haplotype of SNPA, SNPB, SNPC is 1 type, the corresponding specified list of SNPD
Times type is 2 types, and it is 3 that 1 type, which specifies the corresponding SNP number of haplotype,;It is 1 that 2 types, which specify the corresponding SNP number of haplotype,;1 type
The specified corresponding ratio of haplotype is 0.75, is greater than 0.7, therefore, specifies haplotype as belonging to this sequencing fragment on 1 type
Haplotype.
The assembling of genome monoploid is carried out respectively, can be specifically included: according to every sequencing piece of the PacBio data
All sequencing fragments of the PacBio data are carried out genome monoploid assembling by haplotype belonging to section respectively.Referring to upper
Citing is stated, this sequencing fragment is assembled into 1 type and specifies haplotype, traverses all sequencing fragments of PacBio data, and then will
All sequencing fragments are assembled into specified haplotype accordingly.Fig. 5 is to obtain two sets of lists times after the embodiment of the present invention distinguishes haplotype
The result figure of body, as shown in figure 5, the SNP for including on every reads comes from a same monoploid substantially, individual sites exist wrong
Accidentally, the monoploid source of most of reads can be distinguished using 0.7 ratio.
Two sets of reads are assembled respectively by chromosome using canu.Two sets of haploid assembling results such as 1 institute of table
Show,
Table 1
Type | CtgNum | CtgLen | N50 | N90 | CtgMax | GC (%) |
Hap1 | 2,020 | 367,352,803 | 379,142 | 73,992 | 3,086,929 | 43.43 |
Hap2 | 2,028 | 372,970,663 | 375,986 | 76,163 | 2,746,366 | 43.40 |
As can be seen that two sets of haploid genome sizes and continuity are all not much different from result, illustrate to distinguish effect
There is no deviations.
For assessment result accuracy, the sequencing of two generations carried out to two parents of assembling individual, and with reference genome
SNP calling is carried out, two respective SNP of parent have been obtained.The SNP result that phasing is obtained is compared with parent SNP
Compared with obtaining the accuracy of phasing SNP.The monoploid result of assembling is obtained into monoploid with reference to genome comparison simultaneously
The SNP information of genome and reference genome, then judges whether haploid genome SNP is consistent with parent SNP, obtains single times
The accuracy of body genome.The results are shown in Table 2:
Table 2
Wherein Switch error calculation are as follows: the two neighboring previous SNP of SNP is from parent a, and the latter is come
From parent b, then it is considered an error, the ratio that mistake SNP number accounts for total SNP number is then Switch error.Hap ratio is
SNP number from a parent accounts for the ratio of total SNP number.SNP is the accuracy of SNP after carrying out phasing;Contig is assembling
Haploid genome sequence accuracy out.From upper table 2 it is found that the monoploid result that this method assembles have it is very high accurate
Property.
Genome monoploid assemble method provided in an embodiment of the present invention distinguishes PacBio data by integrating phasing information
Sequencing fragment monoploid source, and respectively carry out the assembling of genome monoploid, can be realized the haplotype group of full-length genome
Dress, and improve the fine degree of the genome sequence assembled.
It is on the basis of the above embodiments, described to obtain the SNP information for referring to genome, comprising:
Genome alignment is referred to described using high-flux sequence data.
Specifically, device refers to genome alignment with described using high-flux sequence data.It can refer to above-described embodiment, no
It repeats again.
SNP calling is carried out with reference to genome to described based on comparison result, to obtain the SNP information.
Specifically, device, which is based on comparison result, carries out SNP calling with reference to genome to described, to obtain the SNP information.
It can refer to above-described embodiment, repeat no more.
Genome monoploid assemble method provided in an embodiment of the present invention can effectively obtain SNP information, be further able to
It realizes the monoploid assembling of full-length genome, and improves the fine degree of the genome sequence assembled.
On the basis of the above embodiments, described that SNP information described in PacBio data and Hi-C data acquisition is respectively adopted
Phasing information, comprising:
Genome alignment is referred to described using the PacBio data.
Specifically, device refers to genome alignment with described using the PacBio data.It can refer to above-described embodiment, no
It repeats again.
The SNP information is carried out determining phase by the first comparison information of the PacBio data, described in obtaining
First phasing information of PacBio data.
Specifically, device carries out determining phase by the first comparison information of the PacBio data to the SNP information, to obtain
Take the first phasing information of the PacBio data.It can refer to above-described embodiment, repeat no more.
Genome alignment is referred to described using the Hi-C data.
Specifically, device refers to genome alignment with described using the Hi-C data.It can refer to above-described embodiment, no longer
It repeats.
The SNP information is carried out determining phase by the second comparison information of the Hi-C data, to obtain the Hi-C number
According to the second phasing information.
Specifically, device carries out determining phase by the second comparison information of the Hi-C data to the SNP information, to obtain
Second phasing information of the Hi-C data.It can refer to above-described embodiment, repeat no more.
Genome monoploid assemble method provided in an embodiment of the present invention can effectively obtain the phasing information of SNP information,
It is further able to realize the monoploid assembling of full-length genome, and improves the fine degree of the genome sequence assembled.
On the basis of the above embodiments, the integration phasing information that full-length genome is obtained according to the phasing information,
Include:
The first phasing information of the PacBio data is attached by the second phasing information of the Hi-C data.
Specifically, device is believed by the second phasing information of the Hi-C data to the first of the PacBio data calmly
Breath is attached.It can refer to above-described embodiment, repeat no more.
Error correction is carried out to first phasing information by second phasing information, the integration is fixed to be believed to obtain
Breath.
Specifically, device carries out error correction to first phasing information by second phasing information, described in obtaining
Integrate phasing information.It can refer to above-described embodiment, repeat no more.
Phase is determined in genome monoploid assemble method provided in an embodiment of the present invention, the integration that can effectively obtain full-length genome
Information is further able to realize the monoploid assembling of full-length genome, and improves the fine degree of the genome sequence assembled.
On the basis of the above embodiments, described according to the sequencing integrated phasing information and distinguish the PacBio data
The monoploid source of segment, comprising:
Phasing information is integrated according to described, determines haplotype belonging to every sequencing fragment of the PacBio data.
Specifically, device integrates phasing information according to described, belonging to every sequencing fragment for determining the PacBio data
Haplotype.It can refer to above-described embodiment, repeat no more.
Genome monoploid assemble method provided in an embodiment of the present invention, by every sequencing piece for determining PacBio data
Haplotype belonging to section, can effectively distinguish the monoploid source of the sequencing fragment of PacBio data.
On the basis of the above embodiments, described to integrate phasing information according to described, determine the every of the PacBio data
Haplotype belonging to sequencing fragment, comprising:
All SNP in every sequencing fragment are obtained, each SNP is corresponding with specified haplotype.
Specifically, device obtains all SNP in every sequencing fragment, each SNP is corresponding with specified haplotype.It can refer to
Above-described embodiment repeats no more.
The corresponding SNP number of all specified haplotypes is obtained, the ratio of the total number with all SNP is greater than default
The corresponding specified haplotype of SNP number of ratio is as affiliated haplotype.
Specifically, device obtains the corresponding SNP number of all specified haplotypes, by the total number with all SNP
Ratio is greater than the corresponding specified haplotype of SNP number of preset ratio as affiliated haplotype.It can refer to above-described embodiment, no
It repeats again.
Genome monoploid assemble method provided in an embodiment of the present invention, be further able to it is accurate, reasonably determine
Haplotype belonging to every sequencing fragment of PacBio data.
On the basis of the above embodiments, described and progress genome monoploid assembling respectively, comprising:
All surveys according to haplotype belonging to every sequencing fragment of the PacBio data, to the PacBio data
Sequence segment carries out genome monoploid assembling respectively.
Specifically, device haplotype according to belonging to every sequencing fragment of the PacBio data, to the PacBio
All sequencing fragments of data carry out genome monoploid assembling respectively.It can refer to above-described embodiment, repeat no more.
Genome monoploid assemble method provided in an embodiment of the present invention passes through all sequencing fragments to PacBio data
The assembling of genome monoploid is carried out respectively, is further able to realize the monoploid assembling of full-length genome, and improve the base assembled
Because of the fine degree of group sequence.
Fig. 6 is genome of embodiment of the present invention haplotype group assembling device structural schematic diagram, as shown in fig. 6, the present invention is implemented
Example provides a kind of genome haplotype group assembling device, including first acquisition unit 601, second acquisition unit 602, third obtain
Unit 603 and assembling unit 604, in which:
First acquisition unit 601 is used to obtain the SNP information with reference to genome;Second acquisition unit 602 for adopting respectively
The phasing information of the SNP information described in PacBio data and Hi-C data acquisition;Third acquiring unit 603 is used for according to described fixed
The integration phasing information of phase acquisition of information full-length genome;Assembling unit 604 is used to be integrated described in phasing information differentiation according to described
The monoploid source of the sequencing fragment of PacBio data, and the assembling of genome monoploid is carried out respectively.
Specifically, first acquisition unit 601 is used to obtain the SNP information with reference to genome;Second acquisition unit 602 is used for
The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;Third acquiring unit 603 is used for basis
The phasing information obtains the integration phasing information of full-length genome;Assembling unit 604 is used to integrate phasing information area according to described
Divide the monoploid source of the sequencing fragment of the PacBio data, and carries out the assembling of genome monoploid respectively.
Genome haplotype group assembling device provided in an embodiment of the present invention distinguishes PacBio data by integrating phasing information
Sequencing fragment monoploid source, and respectively carry out the assembling of genome monoploid, can be realized the haplotype group of full-length genome
Dress, and improve the fine degree of the genome sequence assembled.
Genome haplotype group assembling device provided in an embodiment of the present invention, which specifically can be used for executing above-mentioned each method, to be implemented
The process flow of example, details are not described herein for function, is referred to the detailed description of above method embodiment.
Fig. 7 is electronic equipment entity structure schematic diagram provided in an embodiment of the present invention, as shown in fig. 7, the electronic equipment
It include: processor (processor) 701, memory (memory) 702 and bus 703;
Wherein, the processor 701, memory 702 complete mutual communication by bus 703;
The processor 701 is used to call the program instruction in the memory 702, to execute above-mentioned each method embodiment
Provided method, for example, obtain the SNP information for referring to genome;PacBio data are respectively adopted and Hi-C data obtain
Take the phasing information of the SNP information;The integration phasing information of full-length genome is obtained according to the phasing information;According to described whole
The monoploid source that phasing information distinguishes the sequencing fragment of the PacBio data is closed, and carries out genome haplotype group respectively
Dress.
The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient calculating
Computer program on machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is calculated
When machine executes, computer is able to carry out method provided by above-mentioned each method embodiment, for example, obtains with reference to genome
SNP information;The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;Believed calmly according to described
Breath obtains the integration phasing information of full-length genome;According to the sequencing fragment integrated phasing information and distinguish the PacBio data
Monoploid source, and respectively carry out the assembling of genome monoploid.
The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium
Computer instruction is stored, the computer instruction makes the computer execute method provided by above-mentioned each method embodiment, example
It such as include: to obtain the SNP information for referring to genome;SNP information described in PacBio data and Hi-C data acquisition is respectively adopted
Phasing information;The integration phasing information of full-length genome is obtained according to the phasing information;It is distinguished according to the phasing information of integrating
The monoploid source of the sequencing fragment of the PacBio data, and the assembling of genome monoploid is carried out respectively.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through
The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer readable storage medium, the program
When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light
The various media that can store program code such as disk.
The embodiments such as electronic equipment described above are only schematical, wherein it is described as illustrated by the separation member
Unit may or may not be physically separated, and component shown as a unit may or may not be object
Manage unit, it can it is in one place, or may be distributed over multiple network units.It can select according to the actual needs
Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying wound
In the case where the labour for the property made, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can
It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on
Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should
Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers
It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation
Method described in certain parts of example or embodiment.
Finally, it should be noted that the above various embodiments is only to illustrate the technical solution of the embodiment of the present invention, rather than it is right
It is limited;Although the embodiment of the present invention is described in detail referring to foregoing embodiments, the ordinary skill of this field
Personnel are it is understood that it is still possible to modify the technical solutions described in the foregoing embodiments, or to part
Or all technical features are equivalently replaced;And these are modified or replaceed, it does not separate the essence of the corresponding technical solution
The range of various embodiments of the present invention technical solution.
Claims (10)
1. a kind of genome monoploid assemble method characterized by comprising
Obtain the SNP information for referring to genome;
The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;
The integration phasing information of full-length genome is obtained according to the phasing information;
According to the monoploid source integrated phasing information and distinguish the sequencing fragment of the PacBio data, and base is carried out respectively
Because of a group monoploid assembling.
2. the method according to claim 1, wherein described obtain the SNP information for referring to genome, comprising:
Genome alignment is referred to described using high-flux sequence data;
SNP calling is carried out with reference to genome to described based on comparison result, to obtain the SNP information.
3. the method according to claim 1, wherein described be respectively adopted PacBio data and Hi-C data acquisition
The phasing information of the SNP information, comprising:
Genome alignment is referred to described using the PacBio data;
The SNP information is carried out determining phase by the first comparison information of the PacBio data, to obtain the PacBio number
According to the first phasing information;
Genome alignment is referred to described using the Hi-C data;
The SNP information is carried out determining phase by the second comparison information of the Hi-C data, to obtain the Hi-C data
Second phasing information.
4. the method according to claim 1, wherein described obtain the whole of full-length genome according to the phasing information
Close phasing information, comprising:
The first phasing information of the PacBio data is attached by the second phasing information of the Hi-C data;
Error correction is carried out to first phasing information by second phasing information, described integrates phasing information to obtain.
5. method according to any one of claims 1 to 4, which is characterized in that described to be distinguished according to the phasing information of integrating
The monoploid source of the sequencing fragment of the PacBio data, comprising:
Phasing information is integrated according to described, determines haplotype belonging to every sequencing fragment of the PacBio data.
6. according to the method described in claim 5, it is characterized in that, described integrate phasing information according to described, determine described in
Haplotype belonging to every sequencing fragment of PacBio data, comprising:
All SNP in every sequencing fragment are obtained, each SNP is corresponding with specified haplotype;
The corresponding SNP number of all specified haplotypes is obtained, the ratio of the total number with all SNP is greater than preset ratio
The corresponding specified haplotype of SNP number as affiliated haplotype.
7. according to the method described in claim 6, it is characterized in that, described and progress genome monoploid assembling respectively, comprising:
According to haplotype belonging to every sequencing fragment of the PacBio data, to all sequencing pieces of the PacBio data
Section carries out the assembling of genome monoploid respectively.
8. a kind of genome haplotype group assembling device characterized by comprising
First acquisition unit, for obtaining the SNP information for referring to genome;
Second acquisition unit, for the phasing information of SNP information described in PacBio data and Hi-C data acquisition to be respectively adopted;
Third acquiring unit, for obtaining the integration phasing information of full-length genome according to the phasing information;
Assembling unit, for according to it is described integrate phasing information distinguish the PacBio data sequencing fragment monoploid come
Source, and the assembling of genome monoploid is carried out respectively.
9. a kind of electronic equipment characterized by comprising processor, memory and bus, wherein
The processor and the memory complete mutual communication by the bus;
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy
Enough methods executed as described in claim 1 to 7 is any.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited
Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1 to 7 is any.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811069322.6A CN109273052B (en) | 2018-09-13 | 2018-09-13 | Genome haploid assembling method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811069322.6A CN109273052B (en) | 2018-09-13 | 2018-09-13 | Genome haploid assembling method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109273052A true CN109273052A (en) | 2019-01-25 |
CN109273052B CN109273052B (en) | 2022-03-18 |
Family
ID=65188628
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811069322.6A Active CN109273052B (en) | 2018-09-13 | 2018-09-13 | Genome haploid assembling method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109273052B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111816248A (en) * | 2020-05-22 | 2020-10-23 | 武汉菲沙基因信息有限公司 | Complete genome typing method based on Pacbio libraries and Hi-C reads |
CN112908415A (en) * | 2021-02-23 | 2021-06-04 | 广西壮族自治区农业科学院 | Method for obtaining more accurate chromosome level genome |
CN113151426A (en) * | 2021-04-16 | 2021-07-23 | 中国农业科学院兰州畜牧与兽药研究所 | Method for assembling and annotating Hobara sheep genome based on three-generation PacBio and Hi-C technology |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102224801A (en) * | 2011-04-19 | 2011-10-26 | 江苏省农业科学院 | Rapid multi-target property polymerization breeding method for rape |
CN104736722A (en) * | 2012-05-21 | 2015-06-24 | 斯克利普斯研究所 | Methods of sample preparation |
CN105637099A (en) * | 2013-08-23 | 2016-06-01 | 考利达基因组股份有限公司 | Long fragment de novo assembly using short reads |
US20160168632A1 (en) * | 2013-08-02 | 2016-06-16 | Stc Unm | Dna sequencing and epigenome analysis |
US20170235876A1 (en) * | 2016-02-11 | 2017-08-17 | 10X Genomics, Inc. | Systems, methods, and media for de novo assembly of whole genome sequence data |
CN107368705A (en) * | 2011-04-14 | 2017-11-21 | 考利达基因组股份有限公司 | The processing and analysis of complex nucleic acid sequence data |
CN107419000A (en) * | 2016-05-24 | 2017-12-01 | 中国农业科学院作物科学研究所 | A kind of full genome system of selection and its application that prediction Soybean Agronomic Characters phenotype is sampled based on haplotype |
CN107533590A (en) * | 2015-02-17 | 2018-01-02 | 多弗泰尔基因组学有限责任公司 | Nucleotide sequence assembles |
WO2018035070A2 (en) * | 2016-08-16 | 2018-02-22 | Monsanto Technology Llc | Compositions and methods for plant haploid induction |
CN108486236A (en) * | 2012-07-18 | 2018-09-04 | 伊鲁米纳剑桥有限公司 | Method and system for determining haplotype He determining phase haplotype |
-
2018
- 2018-09-13 CN CN201811069322.6A patent/CN109273052B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107368705A (en) * | 2011-04-14 | 2017-11-21 | 考利达基因组股份有限公司 | The processing and analysis of complex nucleic acid sequence data |
CN102224801A (en) * | 2011-04-19 | 2011-10-26 | 江苏省农业科学院 | Rapid multi-target property polymerization breeding method for rape |
CN104736722A (en) * | 2012-05-21 | 2015-06-24 | 斯克利普斯研究所 | Methods of sample preparation |
CN108486236A (en) * | 2012-07-18 | 2018-09-04 | 伊鲁米纳剑桥有限公司 | Method and system for determining haplotype He determining phase haplotype |
US20160168632A1 (en) * | 2013-08-02 | 2016-06-16 | Stc Unm | Dna sequencing and epigenome analysis |
CN105637099A (en) * | 2013-08-23 | 2016-06-01 | 考利达基因组股份有限公司 | Long fragment de novo assembly using short reads |
CN107533590A (en) * | 2015-02-17 | 2018-01-02 | 多弗泰尔基因组学有限责任公司 | Nucleotide sequence assembles |
US20170235876A1 (en) * | 2016-02-11 | 2017-08-17 | 10X Genomics, Inc. | Systems, methods, and media for de novo assembly of whole genome sequence data |
CN107419000A (en) * | 2016-05-24 | 2017-12-01 | 中国农业科学院作物科学研究所 | A kind of full genome system of selection and its application that prediction Soybean Agronomic Characters phenotype is sampled based on haplotype |
WO2018035070A2 (en) * | 2016-08-16 | 2018-02-22 | Monsanto Technology Llc | Compositions and methods for plant haploid induction |
Non-Patent Citations (4)
Title |
---|
DEREK M. BICKHART等: ""Single-molecule sequencing and conformational capture enable de novo mammalian reference genomes"", 《BIORXIV》 * |
PETER EDGE等: ""HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies"", 《GENOME RESEARCH》 * |
S. A. LAPP等: ""PacBio assembly of a Plasmodium knowlesi genome sequence with Hi-C correction and manual annotation of the SICAvar gene family"", 《SPECIAL ISSUE ARTICLE》 * |
ZEV N. KRONENBERG等: ""FALCON-Phase: Integrating PacBio and Hi-C data for phased diploid genomes"", 《BIORXIV》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111816248A (en) * | 2020-05-22 | 2020-10-23 | 武汉菲沙基因信息有限公司 | Complete genome typing method based on Pacbio libraries and Hi-C reads |
CN111816248B (en) * | 2020-05-22 | 2023-12-01 | 武汉菲沙基因信息有限公司 | Pacbio surassemblies and Hi-C reads-based whole genome typing method |
CN112908415A (en) * | 2021-02-23 | 2021-06-04 | 广西壮族自治区农业科学院 | Method for obtaining more accurate chromosome level genome |
CN112908415B (en) * | 2021-02-23 | 2022-05-17 | 广西壮族自治区农业科学院 | Method for obtaining chromosome level genome |
CN113151426A (en) * | 2021-04-16 | 2021-07-23 | 中国农业科学院兰州畜牧与兽药研究所 | Method for assembling and annotating Hobara sheep genome based on three-generation PacBio and Hi-C technology |
Also Published As
Publication number | Publication date |
---|---|
CN109273052B (en) | 2022-03-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Alpaslan et al. | Galaxy And Mass Assembly (GAMA): the large-scale structure of galaxies and comparison to mock universes | |
CN109273052A (en) | A kind of genome monoploid assembling method and device | |
Rosenberg | Discordance of species trees with their most likely gene trees: a unifying principle | |
US20100022752A1 (en) | Identifying components of a network having high importance for network integrity | |
CN109242135A (en) | A kind of model method for running, device and service server | |
CN106492458B (en) | Merging method and device of game server | |
CN109326323A (en) | A kind of assemble method and device of genome | |
Miura et al. | Power and pitfalls of computational methods for inferring clone phylogenies and mutation orders from bulk sequencing data | |
Jariani et al. | SANTA-SIM: simulating viral sequence evolution dynamics under selection and recombination | |
US20190080248A1 (en) | System and method for facilitating model-based classification of transactions | |
Toubiana et al. | A genetic algorithm to optimize weighted gene co-expression network analysis | |
Zaheri et al. | A generalized mechanistic codon model | |
CN109670624A (en) | A kind of method and device for estimating dining waiting time | |
Bansal et al. | The multiple gene duplication problem revisited | |
CN110021355A (en) | The Haplotypes and mutation detection method and device of diploid gene group sequencing fragment | |
Freire et al. | Chromosome-scale reference genome assembly of a diploid potato clone derived from an elite variety | |
Leitner et al. | A computational study of exact approaches for the bi-objective prize-collecting steiner tree problem | |
CN109308710A (en) | Monitoring method, computing device and computer readable storage medium | |
Song et al. | Scaphopoda is the sister taxon to Bivalvia: Evidence of ancient incomplete lineage sorting | |
Yelmen et al. | Improving selection detection with population branch statistic on admixed populations | |
CN116611769B (en) | Order aggregation method, order aggregation device, computer equipment and storage medium | |
CN107277118A (en) | The method and apparatus for generating the conventional access path of node | |
CN108875817A (en) | Identify plug-in method and device, storage medium, electronic device | |
Yap et al. | Identification of evolutionary hotspots in the rodent genomes | |
Zhang et al. | Evidence of site-specific and male-biased germline mutation rate in a wild songbird |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |