CN109273052A - A kind of genome monoploid assembling method and device - Google Patents

A kind of genome monoploid assembling method and device Download PDF

Info

Publication number
CN109273052A
CN109273052A CN201811069322.6A CN201811069322A CN109273052A CN 109273052 A CN109273052 A CN 109273052A CN 201811069322 A CN201811069322 A CN 201811069322A CN 109273052 A CN109273052 A CN 109273052A
Authority
CN
China
Prior art keywords
genome
information
snp
data
monoploid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811069322.6A
Other languages
Chinese (zh)
Other versions
CN109273052B (en
Inventor
郑洪坤
邓德晶
刘敏
李绪明
黎松
刘福
刘东源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING BIOMARKER TECHNOLOGIES Co Ltd
Original Assignee
BEIJING BIOMARKER TECHNOLOGIES Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING BIOMARKER TECHNOLOGIES Co Ltd filed Critical BEIJING BIOMARKER TECHNOLOGIES Co Ltd
Priority to CN201811069322.6A priority Critical patent/CN109273052B/en
Publication of CN109273052A publication Critical patent/CN109273052A/en
Application granted granted Critical
Publication of CN109273052B publication Critical patent/CN109273052B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present invention provides a kind of genome monoploid assembling method and device, which comprises obtains the SNP information for referring to genome;The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;The integration phasing information of full-length genome is obtained according to the phasing information;According to the monoploid source integrated phasing information and distinguish the sequencing fragment of the PacBio data, and the assembling of genome monoploid is carried out respectively.Described device executes the above method.Genome monoploid provided in an embodiment of the present invention assembles method and device, the monoploid source for distinguishing the sequencing fragment of PacBio data by integrating phasing information, and the assembling of genome monoploid is carried out respectively, it can be realized the monoploid assembling of full-length genome, and improve the fine degree of the genome sequence assembled.

Description

A kind of genome monoploid assembling method and device
Technical field
The present embodiments relate to gene engineering technology fields, and in particular to a kind of genome monoploid assemble method and dress It sets.
Background technique
With the continuous development of sequencing technologies, genome package technique is also being constantly improve.It is surveyed by the sequencing of two generations to three generations Sequence, in the case that sequencing fragment reads is increasingly grown, the effect of genome assembling is also become better and better.Current genome assembling There are mainly three types of algorithms, de-bruijn-graph (DBG), overlap-layout-consensus (OLC) and string Graph, but no matter which kind of method all only assembles a set of genome of monoploid size for diploid gene group, and Centre is not distinguish homologue.Such case is mainly limited by current technology condition, due to homologous dye Similitude is high between colour solid, and sequencing fragment reads curtailment is with across the same clip between homologue, thus can not be true The phase relation of fixed front and back difference.
Due to the presence of such case, having many softwares at present all is that (mononucleotide is more for the SNP between homologue The abbreviation of state property) phasing (determining phase) is carried out, phase relation is determined, such as SNPHap, SHAPEIT, WhatsHap and HapCut Deng.Wherein WhatsHap can carry out phasing to SNP with three generations PacBio data, more many than two generation effect promotings, The haplotype block length that phasing is obtained is significantly increased, and quantity is reduced.Nonetheless, for certain homozygous regions, Reads still can not be across.HapCut then can carry out phasing to Hi-C data, distinguish haplotype in Chromosome level, but It is limited to the principle of Hi-C sequencing, the SNP of phasing is distributed sparse in the genome.The composite software of three generations's PacBio data Falcon has a subsequent analysis tool Falcon-unzip, can use the variation such as SNP, SV for analyzing and in assembling process There is the region of variation to the result of assembling, re-assembly out monoploid in information, certain species gene groups are assembled also effective Fruit.But likewise, being limited to data reads length, similar with WhatsHap is to assemble a section monoploid section, section it Between phase relation still cannot be distinguished.
Therefore, how drawbacks described above is avoided, realizes the monoploid assembling of full-length genome, and improve the genome sequence assembled The fine degree of column, becoming need solve the problems, such as.
Summary of the invention
In view of the problems of the existing technology, the embodiment of the present invention provides a kind of genome monoploid assemble method and dress It sets.
In a first aspect, the embodiment of the present invention provides a kind of genome monoploid assemble method, which comprises
Obtain the SNP information for referring to genome;
The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;
The integration phasing information of full-length genome is obtained according to the phasing information;
According to the monoploid source integrated phasing information and distinguish the sequencing fragment of the PacBio data, and respectively into The assembling of row genome monoploid.
Second aspect, the embodiment of the present invention provide a kind of genome haplotype group assembling device, and described device includes:
First acquisition unit, for obtaining the SNP information for referring to genome;
Second acquisition unit is believed for the fixed of SNP information described in PacBio data and Hi-C data acquisition to be respectively adopted Breath;
Third acquiring unit, for obtaining the integration phasing information of full-length genome according to the phasing information;
Assembling unit, for according to the monoploid integrated phasing information and distinguish the sequencing fragment of the PacBio data Source, and the assembling of genome monoploid is carried out respectively.
The third aspect, the embodiment of the present invention provide a kind of electronic equipment, comprising: processor, memory and bus, wherein
The processor and the memory complete mutual communication by the bus;
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to refer to Order is able to carry out following method:
Obtain the SNP information for referring to genome;
The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;
The integration phasing information of full-length genome is obtained according to the phasing information;
According to the monoploid source integrated phasing information and distinguish the sequencing fragment of the PacBio data, and respectively into The assembling of row genome monoploid.
Fourth aspect, the embodiment of the present invention provide a kind of non-transient computer readable storage medium, comprising:
The non-transient computer readable storage medium stores computer instruction, and the computer instruction makes the computer Execute following method:
Obtain the SNP information for referring to genome;
The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;
The integration phasing information of full-length genome is obtained according to the phasing information;
According to the monoploid source integrated phasing information and distinguish the sequencing fragment of the PacBio data, and respectively into The assembling of row genome monoploid.
Genome monoploid provided in an embodiment of the present invention assembles method and device, is distinguished by integrating phasing information The monoploid source of the sequencing fragment of PacBio data, and the assembling of genome monoploid is carried out respectively, it can be realized full-length genome Monoploid assembling, and improve the fine degree of genome sequence assembled.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.
Fig. 1 is that genome of embodiment of the present invention monoploid assembles method flow schematic diagram;
Fig. 2 is the first phasing information of PacBio of embodiment of the present invention data;
Fig. 3 is the second phasing information of Hi-C of embodiment of the present invention data;
Fig. 4 is the haplotype result of full-length genome of embodiment of the present invention Chromosome level;
Fig. 5 is to obtain two sets of haploid result figures after the embodiment of the present invention distinguishes haplotype;
Fig. 6 is genome of embodiment of the present invention haplotype group assembling device structural schematic diagram;
Fig. 7 is electronic equipment entity structure schematic diagram provided in an embodiment of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
Fig. 1 is that genome of embodiment of the present invention monoploid assembles method flow schematic diagram, as shown in Figure 1, the present invention is implemented A kind of genome monoploid assemble method that example provides, comprising the following steps:
S101: the SNP information for referring to genome is obtained.
Specifically, device obtains the SNP information for referring to genome.Device can be understood as executing the equipment etc. of this method. It specifically can be such that using high-flux sequence data with described with reference to genome alignment;Base is referred to described based on comparison result Because of group progress SNP calling (SNP calling), to obtain the SNP information.It is possible to further use bwa by reads ratio To on reference genome, the SNP information with reference to genome is obtained by samtools.
S102: the phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted.
Specifically, the phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted in device.Specifically It can be such that using the PacBio data with described with reference to genome alignment;It is compared by the first of the PacBio data Information carries out determining phase to the SNP information, to obtain the first phasing information of the PacBio data;It is possible to further make With blasr by PacBio comparing to reference genome, by the first comparison information with WhatsHap to reference genome into Row SNP phasing (SNP determines phase).
Genome alignment is referred to described using the Hi-C data;Pass through the second comparison information pair of the Hi-C data The SNP information carries out determining phase, to obtain the second phasing information of the Hi-C data.It is possible to further use bwa will Second comparison information is carried out SNP phasing to reference genome with HapCut2 to genome is referred to by Hi-C comparing (SNP determines phase).Fig. 2 is the first phasing information of PacBio of embodiment of the present invention data, and Fig. 3 is Hi-C of embodiment of the present invention data The second phasing information, as shown in Figures 2 and 3, the SNP of result that PacBio data phasing SNP is obtained covering is more continuous, But monoploid block is smaller;The result span that Hi-C data phasing SNP is obtained is larger, but the SNP of centre phasing is diluter It dredges.
S103: the integration phasing information of full-length genome is obtained according to the phasing information.
Specifically, device obtains the integration phasing information of full-length genome according to the phasing information.It specifically can be such that logical The second phasing information for crossing the Hi-C data is attached the first phasing information of the PacBio data;Pass through described Two phasing informations carry out error correction to first phasing information, described integrate phasing information to obtain.Fig. 4 is the embodiment of the present invention The haplotype of full-length genome Chromosome level is as a result, as shown in figure 4, each chromosome has a list across whole chromosome Times body block, contains most of SNP on chromosome, such as 196 to the 47 of Lachesis_group0,588,925 blocks, Contain 252,046 variant sites;The 1 of Lachesis_group1,784 to 41,365,984, contain 178,242 changes Ectopic sites.And it is remaining be all individual sites composition block of cells.
S104: according to the monoploid source integrated phasing information and distinguish the sequencing fragment of the PacBio data, and The assembling of genome monoploid is carried out respectively.
Specifically, device according to it is described integrate phasing information distinguish the PacBio data sequencing fragment monoploid come Source, and the assembling of genome monoploid is carried out respectively.Specifically can be such that and integrate phasing information according to described, determine described in Haplotype belonging to every sequencing fragment of PacBio data.Further, all SNP in every sequencing fragment are obtained, often A SNP is corresponding with specified haplotype;The corresponding SNP number of all specified haplotypes is obtained, by the total number with all SNP Ratio be greater than preset ratio the corresponding specified haplotype of SNP number as affiliated haplotype.Preset ratio can basis Actual conditions are independently arranged, and are chosen as 0.7, are illustrated below: all SNP of certain sequencing fragment have 4, respectively SNPA, SNPB, SNPC, SNPD, wherein the corresponding specified haplotype of SNPA, SNPB, SNPC is 1 type, the corresponding specified list of SNPD Times type is 2 types, and it is 3 that 1 type, which specifies the corresponding SNP number of haplotype,;It is 1 that 2 types, which specify the corresponding SNP number of haplotype,;1 type The specified corresponding ratio of haplotype is 0.75, is greater than 0.7, therefore, specifies haplotype as belonging to this sequencing fragment on 1 type Haplotype.
The assembling of genome monoploid is carried out respectively, can be specifically included: according to every sequencing piece of the PacBio data All sequencing fragments of the PacBio data are carried out genome monoploid assembling by haplotype belonging to section respectively.Referring to upper Citing is stated, this sequencing fragment is assembled into 1 type and specifies haplotype, traverses all sequencing fragments of PacBio data, and then will All sequencing fragments are assembled into specified haplotype accordingly.Fig. 5 is to obtain two sets of lists times after the embodiment of the present invention distinguishes haplotype The result figure of body, as shown in figure 5, the SNP for including on every reads comes from a same monoploid substantially, individual sites exist wrong Accidentally, the monoploid source of most of reads can be distinguished using 0.7 ratio.
Two sets of reads are assembled respectively by chromosome using canu.Two sets of haploid assembling results such as 1 institute of table Show,
Table 1
Type CtgNum CtgLen N50 N90 CtgMax GC (%)
Hap1 2,020 367,352,803 379,142 73,992 3,086,929 43.43
Hap2 2,028 372,970,663 375,986 76,163 2,746,366 43.40
As can be seen that two sets of haploid genome sizes and continuity are all not much different from result, illustrate to distinguish effect There is no deviations.
For assessment result accuracy, the sequencing of two generations carried out to two parents of assembling individual, and with reference genome SNP calling is carried out, two respective SNP of parent have been obtained.The SNP result that phasing is obtained is compared with parent SNP Compared with obtaining the accuracy of phasing SNP.The monoploid result of assembling is obtained into monoploid with reference to genome comparison simultaneously The SNP information of genome and reference genome, then judges whether haploid genome SNP is consistent with parent SNP, obtains single times The accuracy of body genome.The results are shown in Table 2:
Table 2
Wherein Switch error calculation are as follows: the two neighboring previous SNP of SNP is from parent a, and the latter is come From parent b, then it is considered an error, the ratio that mistake SNP number accounts for total SNP number is then Switch error.Hap ratio is SNP number from a parent accounts for the ratio of total SNP number.SNP is the accuracy of SNP after carrying out phasing;Contig is assembling Haploid genome sequence accuracy out.From upper table 2 it is found that the monoploid result that this method assembles have it is very high accurate Property.
Genome monoploid assemble method provided in an embodiment of the present invention distinguishes PacBio data by integrating phasing information Sequencing fragment monoploid source, and respectively carry out the assembling of genome monoploid, can be realized the haplotype group of full-length genome Dress, and improve the fine degree of the genome sequence assembled.
It is on the basis of the above embodiments, described to obtain the SNP information for referring to genome, comprising:
Genome alignment is referred to described using high-flux sequence data.
Specifically, device refers to genome alignment with described using high-flux sequence data.It can refer to above-described embodiment, no It repeats again.
SNP calling is carried out with reference to genome to described based on comparison result, to obtain the SNP information.
Specifically, device, which is based on comparison result, carries out SNP calling with reference to genome to described, to obtain the SNP information. It can refer to above-described embodiment, repeat no more.
Genome monoploid assemble method provided in an embodiment of the present invention can effectively obtain SNP information, be further able to It realizes the monoploid assembling of full-length genome, and improves the fine degree of the genome sequence assembled.
On the basis of the above embodiments, described that SNP information described in PacBio data and Hi-C data acquisition is respectively adopted Phasing information, comprising:
Genome alignment is referred to described using the PacBio data.
Specifically, device refers to genome alignment with described using the PacBio data.It can refer to above-described embodiment, no It repeats again.
The SNP information is carried out determining phase by the first comparison information of the PacBio data, described in obtaining First phasing information of PacBio data.
Specifically, device carries out determining phase by the first comparison information of the PacBio data to the SNP information, to obtain Take the first phasing information of the PacBio data.It can refer to above-described embodiment, repeat no more.
Genome alignment is referred to described using the Hi-C data.
Specifically, device refers to genome alignment with described using the Hi-C data.It can refer to above-described embodiment, no longer It repeats.
The SNP information is carried out determining phase by the second comparison information of the Hi-C data, to obtain the Hi-C number According to the second phasing information.
Specifically, device carries out determining phase by the second comparison information of the Hi-C data to the SNP information, to obtain Second phasing information of the Hi-C data.It can refer to above-described embodiment, repeat no more.
Genome monoploid assemble method provided in an embodiment of the present invention can effectively obtain the phasing information of SNP information, It is further able to realize the monoploid assembling of full-length genome, and improves the fine degree of the genome sequence assembled.
On the basis of the above embodiments, the integration phasing information that full-length genome is obtained according to the phasing information, Include:
The first phasing information of the PacBio data is attached by the second phasing information of the Hi-C data.
Specifically, device is believed by the second phasing information of the Hi-C data to the first of the PacBio data calmly Breath is attached.It can refer to above-described embodiment, repeat no more.
Error correction is carried out to first phasing information by second phasing information, the integration is fixed to be believed to obtain Breath.
Specifically, device carries out error correction to first phasing information by second phasing information, described in obtaining Integrate phasing information.It can refer to above-described embodiment, repeat no more.
Phase is determined in genome monoploid assemble method provided in an embodiment of the present invention, the integration that can effectively obtain full-length genome Information is further able to realize the monoploid assembling of full-length genome, and improves the fine degree of the genome sequence assembled.
On the basis of the above embodiments, described according to the sequencing integrated phasing information and distinguish the PacBio data The monoploid source of segment, comprising:
Phasing information is integrated according to described, determines haplotype belonging to every sequencing fragment of the PacBio data.
Specifically, device integrates phasing information according to described, belonging to every sequencing fragment for determining the PacBio data Haplotype.It can refer to above-described embodiment, repeat no more.
Genome monoploid assemble method provided in an embodiment of the present invention, by every sequencing piece for determining PacBio data Haplotype belonging to section, can effectively distinguish the monoploid source of the sequencing fragment of PacBio data.
On the basis of the above embodiments, described to integrate phasing information according to described, determine the every of the PacBio data Haplotype belonging to sequencing fragment, comprising:
All SNP in every sequencing fragment are obtained, each SNP is corresponding with specified haplotype.
Specifically, device obtains all SNP in every sequencing fragment, each SNP is corresponding with specified haplotype.It can refer to Above-described embodiment repeats no more.
The corresponding SNP number of all specified haplotypes is obtained, the ratio of the total number with all SNP is greater than default The corresponding specified haplotype of SNP number of ratio is as affiliated haplotype.
Specifically, device obtains the corresponding SNP number of all specified haplotypes, by the total number with all SNP Ratio is greater than the corresponding specified haplotype of SNP number of preset ratio as affiliated haplotype.It can refer to above-described embodiment, no It repeats again.
Genome monoploid assemble method provided in an embodiment of the present invention, be further able to it is accurate, reasonably determine Haplotype belonging to every sequencing fragment of PacBio data.
On the basis of the above embodiments, described and progress genome monoploid assembling respectively, comprising:
All surveys according to haplotype belonging to every sequencing fragment of the PacBio data, to the PacBio data Sequence segment carries out genome monoploid assembling respectively.
Specifically, device haplotype according to belonging to every sequencing fragment of the PacBio data, to the PacBio All sequencing fragments of data carry out genome monoploid assembling respectively.It can refer to above-described embodiment, repeat no more.
Genome monoploid assemble method provided in an embodiment of the present invention passes through all sequencing fragments to PacBio data The assembling of genome monoploid is carried out respectively, is further able to realize the monoploid assembling of full-length genome, and improve the base assembled Because of the fine degree of group sequence.
Fig. 6 is genome of embodiment of the present invention haplotype group assembling device structural schematic diagram, as shown in fig. 6, the present invention is implemented Example provides a kind of genome haplotype group assembling device, including first acquisition unit 601, second acquisition unit 602, third obtain Unit 603 and assembling unit 604, in which:
First acquisition unit 601 is used to obtain the SNP information with reference to genome;Second acquisition unit 602 for adopting respectively The phasing information of the SNP information described in PacBio data and Hi-C data acquisition;Third acquiring unit 603 is used for according to described fixed The integration phasing information of phase acquisition of information full-length genome;Assembling unit 604 is used to be integrated described in phasing information differentiation according to described The monoploid source of the sequencing fragment of PacBio data, and the assembling of genome monoploid is carried out respectively.
Specifically, first acquisition unit 601 is used to obtain the SNP information with reference to genome;Second acquisition unit 602 is used for The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;Third acquiring unit 603 is used for basis The phasing information obtains the integration phasing information of full-length genome;Assembling unit 604 is used to integrate phasing information area according to described Divide the monoploid source of the sequencing fragment of the PacBio data, and carries out the assembling of genome monoploid respectively.
Genome haplotype group assembling device provided in an embodiment of the present invention distinguishes PacBio data by integrating phasing information Sequencing fragment monoploid source, and respectively carry out the assembling of genome monoploid, can be realized the haplotype group of full-length genome Dress, and improve the fine degree of the genome sequence assembled.
Genome haplotype group assembling device provided in an embodiment of the present invention, which specifically can be used for executing above-mentioned each method, to be implemented The process flow of example, details are not described herein for function, is referred to the detailed description of above method embodiment.
Fig. 7 is electronic equipment entity structure schematic diagram provided in an embodiment of the present invention, as shown in fig. 7, the electronic equipment It include: processor (processor) 701, memory (memory) 702 and bus 703;
Wherein, the processor 701, memory 702 complete mutual communication by bus 703;
The processor 701 is used to call the program instruction in the memory 702, to execute above-mentioned each method embodiment Provided method, for example, obtain the SNP information for referring to genome;PacBio data are respectively adopted and Hi-C data obtain Take the phasing information of the SNP information;The integration phasing information of full-length genome is obtained according to the phasing information;According to described whole The monoploid source that phasing information distinguishes the sequencing fragment of the PacBio data is closed, and carries out genome haplotype group respectively Dress.
The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient calculating Computer program on machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is calculated When machine executes, computer is able to carry out method provided by above-mentioned each method embodiment, for example, obtains with reference to genome SNP information;The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;Believed calmly according to described Breath obtains the integration phasing information of full-length genome;According to the sequencing fragment integrated phasing information and distinguish the PacBio data Monoploid source, and respectively carry out the assembling of genome monoploid.
The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Computer instruction is stored, the computer instruction makes the computer execute method provided by above-mentioned each method embodiment, example It such as include: to obtain the SNP information for referring to genome;SNP information described in PacBio data and Hi-C data acquisition is respectively adopted Phasing information;The integration phasing information of full-length genome is obtained according to the phasing information;It is distinguished according to the phasing information of integrating The monoploid source of the sequencing fragment of the PacBio data, and the assembling of genome monoploid is carried out respectively.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer readable storage medium, the program When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes: ROM, RAM, magnetic disk or light The various media that can store program code such as disk.
The embodiments such as electronic equipment described above are only schematical, wherein it is described as illustrated by the separation member Unit may or may not be physically separated, and component shown as a unit may or may not be object Manage unit, it can it is in one place, or may be distributed over multiple network units.It can select according to the actual needs Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying wound In the case where the labour for the property made, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.
Finally, it should be noted that the above various embodiments is only to illustrate the technical solution of the embodiment of the present invention, rather than it is right It is limited;Although the embodiment of the present invention is described in detail referring to foregoing embodiments, the ordinary skill of this field Personnel are it is understood that it is still possible to modify the technical solutions described in the foregoing embodiments, or to part Or all technical features are equivalently replaced;And these are modified or replaceed, it does not separate the essence of the corresponding technical solution The range of various embodiments of the present invention technical solution.

Claims (10)

1. a kind of genome monoploid assemble method characterized by comprising
Obtain the SNP information for referring to genome;
The phasing information of SNP information described in PacBio data and Hi-C data acquisition is respectively adopted;
The integration phasing information of full-length genome is obtained according to the phasing information;
According to the monoploid source integrated phasing information and distinguish the sequencing fragment of the PacBio data, and base is carried out respectively Because of a group monoploid assembling.
2. the method according to claim 1, wherein described obtain the SNP information for referring to genome, comprising:
Genome alignment is referred to described using high-flux sequence data;
SNP calling is carried out with reference to genome to described based on comparison result, to obtain the SNP information.
3. the method according to claim 1, wherein described be respectively adopted PacBio data and Hi-C data acquisition The phasing information of the SNP information, comprising:
Genome alignment is referred to described using the PacBio data;
The SNP information is carried out determining phase by the first comparison information of the PacBio data, to obtain the PacBio number According to the first phasing information;
Genome alignment is referred to described using the Hi-C data;
The SNP information is carried out determining phase by the second comparison information of the Hi-C data, to obtain the Hi-C data Second phasing information.
4. the method according to claim 1, wherein described obtain the whole of full-length genome according to the phasing information Close phasing information, comprising:
The first phasing information of the PacBio data is attached by the second phasing information of the Hi-C data;
Error correction is carried out to first phasing information by second phasing information, described integrates phasing information to obtain.
5. method according to any one of claims 1 to 4, which is characterized in that described to be distinguished according to the phasing information of integrating The monoploid source of the sequencing fragment of the PacBio data, comprising:
Phasing information is integrated according to described, determines haplotype belonging to every sequencing fragment of the PacBio data.
6. according to the method described in claim 5, it is characterized in that, described integrate phasing information according to described, determine described in Haplotype belonging to every sequencing fragment of PacBio data, comprising:
All SNP in every sequencing fragment are obtained, each SNP is corresponding with specified haplotype;
The corresponding SNP number of all specified haplotypes is obtained, the ratio of the total number with all SNP is greater than preset ratio The corresponding specified haplotype of SNP number as affiliated haplotype.
7. according to the method described in claim 6, it is characterized in that, described and progress genome monoploid assembling respectively, comprising:
According to haplotype belonging to every sequencing fragment of the PacBio data, to all sequencing pieces of the PacBio data Section carries out the assembling of genome monoploid respectively.
8. a kind of genome haplotype group assembling device characterized by comprising
First acquisition unit, for obtaining the SNP information for referring to genome;
Second acquisition unit, for the phasing information of SNP information described in PacBio data and Hi-C data acquisition to be respectively adopted;
Third acquiring unit, for obtaining the integration phasing information of full-length genome according to the phasing information;
Assembling unit, for according to it is described integrate phasing information distinguish the PacBio data sequencing fragment monoploid come Source, and the assembling of genome monoploid is carried out respectively.
9. a kind of electronic equipment characterized by comprising processor, memory and bus, wherein
The processor and the memory complete mutual communication by the bus;
The memory is stored with the program instruction that can be executed by the processor, and the processor calls described program to instruct energy Enough methods executed as described in claim 1 to 7 is any.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1 to 7 is any.
CN201811069322.6A 2018-09-13 2018-09-13 Genome haploid assembling method and device Active CN109273052B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811069322.6A CN109273052B (en) 2018-09-13 2018-09-13 Genome haploid assembling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811069322.6A CN109273052B (en) 2018-09-13 2018-09-13 Genome haploid assembling method and device

Publications (2)

Publication Number Publication Date
CN109273052A true CN109273052A (en) 2019-01-25
CN109273052B CN109273052B (en) 2022-03-18

Family

ID=65188628

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811069322.6A Active CN109273052B (en) 2018-09-13 2018-09-13 Genome haploid assembling method and device

Country Status (1)

Country Link
CN (1) CN109273052B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816248A (en) * 2020-05-22 2020-10-23 武汉菲沙基因信息有限公司 Complete genome typing method based on Pacbio libraries and Hi-C reads
CN112908415A (en) * 2021-02-23 2021-06-04 广西壮族自治区农业科学院 Method for obtaining more accurate chromosome level genome
CN113151426A (en) * 2021-04-16 2021-07-23 中国农业科学院兰州畜牧与兽药研究所 Method for assembling and annotating Hobara sheep genome based on three-generation PacBio and Hi-C technology

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102224801A (en) * 2011-04-19 2011-10-26 江苏省农业科学院 Rapid multi-target property polymerization breeding method for rape
CN104736722A (en) * 2012-05-21 2015-06-24 斯克利普斯研究所 Methods of sample preparation
CN105637099A (en) * 2013-08-23 2016-06-01 考利达基因组股份有限公司 Long fragment de novo assembly using short reads
US20160168632A1 (en) * 2013-08-02 2016-06-16 Stc Unm Dna sequencing and epigenome analysis
US20170235876A1 (en) * 2016-02-11 2017-08-17 10X Genomics, Inc. Systems, methods, and media for de novo assembly of whole genome sequence data
CN107368705A (en) * 2011-04-14 2017-11-21 考利达基因组股份有限公司 The processing and analysis of complex nucleic acid sequence data
CN107419000A (en) * 2016-05-24 2017-12-01 中国农业科学院作物科学研究所 A kind of full genome system of selection and its application that prediction Soybean Agronomic Characters phenotype is sampled based on haplotype
CN107533590A (en) * 2015-02-17 2018-01-02 多弗泰尔基因组学有限责任公司 Nucleotide sequence assembles
WO2018035070A2 (en) * 2016-08-16 2018-02-22 Monsanto Technology Llc Compositions and methods for plant haploid induction
CN108486236A (en) * 2012-07-18 2018-09-04 伊鲁米纳剑桥有限公司 Method and system for determining haplotype He determining phase haplotype

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368705A (en) * 2011-04-14 2017-11-21 考利达基因组股份有限公司 The processing and analysis of complex nucleic acid sequence data
CN102224801A (en) * 2011-04-19 2011-10-26 江苏省农业科学院 Rapid multi-target property polymerization breeding method for rape
CN104736722A (en) * 2012-05-21 2015-06-24 斯克利普斯研究所 Methods of sample preparation
CN108486236A (en) * 2012-07-18 2018-09-04 伊鲁米纳剑桥有限公司 Method and system for determining haplotype He determining phase haplotype
US20160168632A1 (en) * 2013-08-02 2016-06-16 Stc Unm Dna sequencing and epigenome analysis
CN105637099A (en) * 2013-08-23 2016-06-01 考利达基因组股份有限公司 Long fragment de novo assembly using short reads
CN107533590A (en) * 2015-02-17 2018-01-02 多弗泰尔基因组学有限责任公司 Nucleotide sequence assembles
US20170235876A1 (en) * 2016-02-11 2017-08-17 10X Genomics, Inc. Systems, methods, and media for de novo assembly of whole genome sequence data
CN107419000A (en) * 2016-05-24 2017-12-01 中国农业科学院作物科学研究所 A kind of full genome system of selection and its application that prediction Soybean Agronomic Characters phenotype is sampled based on haplotype
WO2018035070A2 (en) * 2016-08-16 2018-02-22 Monsanto Technology Llc Compositions and methods for plant haploid induction

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DEREK M. BICKHART等: ""Single-molecule sequencing and conformational capture enable de novo mammalian reference genomes"", 《BIORXIV》 *
PETER EDGE等: ""HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies"", 《GENOME RESEARCH》 *
S. A. LAPP等: ""PacBio assembly of a Plasmodium knowlesi genome sequence with Hi-C correction and manual annotation of the SICAvar gene family"", 《SPECIAL ISSUE ARTICLE》 *
ZEV N. KRONENBERG等: ""FALCON-Phase: Integrating PacBio and Hi-C data for phased diploid genomes"", 《BIORXIV》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816248A (en) * 2020-05-22 2020-10-23 武汉菲沙基因信息有限公司 Complete genome typing method based on Pacbio libraries and Hi-C reads
CN111816248B (en) * 2020-05-22 2023-12-01 武汉菲沙基因信息有限公司 Pacbio surassemblies and Hi-C reads-based whole genome typing method
CN112908415A (en) * 2021-02-23 2021-06-04 广西壮族自治区农业科学院 Method for obtaining more accurate chromosome level genome
CN112908415B (en) * 2021-02-23 2022-05-17 广西壮族自治区农业科学院 Method for obtaining chromosome level genome
CN113151426A (en) * 2021-04-16 2021-07-23 中国农业科学院兰州畜牧与兽药研究所 Method for assembling and annotating Hobara sheep genome based on three-generation PacBio and Hi-C technology

Also Published As

Publication number Publication date
CN109273052B (en) 2022-03-18

Similar Documents

Publication Publication Date Title
Alpaslan et al. Galaxy And Mass Assembly (GAMA): the large-scale structure of galaxies and comparison to mock universes
CN109273052A (en) A kind of genome monoploid assembling method and device
Rosenberg Discordance of species trees with their most likely gene trees: a unifying principle
US20100022752A1 (en) Identifying components of a network having high importance for network integrity
CN109242135A (en) A kind of model method for running, device and service server
CN106492458B (en) Merging method and device of game server
CN109326323A (en) A kind of assemble method and device of genome
Miura et al. Power and pitfalls of computational methods for inferring clone phylogenies and mutation orders from bulk sequencing data
Jariani et al. SANTA-SIM: simulating viral sequence evolution dynamics under selection and recombination
US20190080248A1 (en) System and method for facilitating model-based classification of transactions
Toubiana et al. A genetic algorithm to optimize weighted gene co-expression network analysis
Zaheri et al. A generalized mechanistic codon model
CN109670624A (en) A kind of method and device for estimating dining waiting time
Bansal et al. The multiple gene duplication problem revisited
CN110021355A (en) The Haplotypes and mutation detection method and device of diploid gene group sequencing fragment
Freire et al. Chromosome-scale reference genome assembly of a diploid potato clone derived from an elite variety
Leitner et al. A computational study of exact approaches for the bi-objective prize-collecting steiner tree problem
CN109308710A (en) Monitoring method, computing device and computer readable storage medium
Song et al. Scaphopoda is the sister taxon to Bivalvia: Evidence of ancient incomplete lineage sorting
Yelmen et al. Improving selection detection with population branch statistic on admixed populations
CN116611769B (en) Order aggregation method, order aggregation device, computer equipment and storage medium
CN107277118A (en) The method and apparatus for generating the conventional access path of node
CN108875817A (en) Identify plug-in method and device, storage medium, electronic device
Yap et al. Identification of evolutionary hotspots in the rodent genomes
Zhang et al. Evidence of site-specific and male-biased germline mutation rate in a wild songbird

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant