WO2015149719A1 - 杂合基因组处理方法 - Google Patents

杂合基因组处理方法 Download PDF

Info

Publication number
WO2015149719A1
WO2015149719A1 PCT/CN2015/075871 CN2015075871W WO2015149719A1 WO 2015149719 A1 WO2015149719 A1 WO 2015149719A1 CN 2015075871 W CN2015075871 W CN 2015075871W WO 2015149719 A1 WO2015149719 A1 WO 2015149719A1
Authority
WO
WIPO (PCT)
Prior art keywords
genome
hybrid
mer
processing method
species
Prior art date
Application number
PCT/CN2015/075871
Other languages
English (en)
French (fr)
Inventor
詹东亮
梁文颖
Original Assignee
深圳华大基因科技服务有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技服务有限公司 filed Critical 深圳华大基因科技服务有限公司
Publication of WO2015149719A1 publication Critical patent/WO2015149719A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Definitions

  • the invention relates to the field of genetic engineering technology and the field of genomic bioinformatics analysis.
  • the invention relates to hybrid genome processing to obtain a genome-wide sequence map of a species.
  • Whole-genome de novo sequencing is also called de novo sequencing.
  • a species can be sequenced without any reference sequence, and spliced and assembled by bioinformatics analysis to obtain a genome-wide sequence map of the species. Obtaining a genome-wide sequence of a species is an important shortcut to speeding up the understanding of this species.
  • the technical problem to be solved by the present invention is that the conventional hybrid genomic mapping scheme in the prior art has high cost and long period, which seriously restricts the defects of the interpretation of the genome of the species, and provides a hybrid genome processing method.
  • the technical solution adopted by the present invention to solve the technical problem thereof is to construct a hybrid genome processing method, which comprises:
  • Hybrid recognition processing is performed on the genomic map with redundant sequences to remove redundant Scaffolds in the hybrid region and retain heterozygous region information to obtain a genome-wide map.
  • the hybrid recognition process includes identifying the hybrid region based on a k-mer profile.
  • the identifying the heterozygous region based on the k-mer profile comprises:
  • the Unique k-mer represents a k-mer that appears only once in the genome, and k-mer represents a short sequence of a fixed length k, and k is a positive integer;
  • the heterozygous region is determined based on the Unique k-mer region.
  • the determining the heterozygous region according to the Unique k-mer region comprises:
  • N and M are positive integers
  • the two Scaffolds are determined as corresponding hybrid regions.
  • the redundant Scaffold in the hybrid region is removed based at least on the length of the assembly.
  • the contig group sequence Contig is also constructed.
  • the genome genome annotation and/or genome evolution analysis of the whole genome map is further included.
  • the genomic annotation includes a Repeat annotation, a gene prediction, a gene function annotation, and an ncRNA annotation;
  • the genomic evolution analysis includes gene family identification, species phylogenetic tree construction, species divergence time estimation, Genomic collinearity analysis and genome-wide replication analysis.
  • the genome-wide shotgun method is used to perform ultra-high-depth sequencing of the target species by at least 200 times to obtain an effective read length sequence Reads including:
  • the constructed library was subjected to ultra-high-depth sequencing by at least 200-fold using the genome-wide shotgun method to obtain the original Reads;
  • the original Reads are filtered to obtain valid Reads.
  • constructing a library includes constructing a small fragment library and constructing a large fragment library, wherein the small fragment library includes a plurality of insert libraries of different lengths, and the large fragment library includes a plurality of paired Mates having different lengths. Pair library.
  • constructing a small fragment library includes:
  • the 3' end of the DNA fragment repaired at the end is linked to the base A;
  • a PCR reaction and PCR amplification were performed on the DNA fragment to which the linker Adapter was attached, according to the length of the fragment of interest, to construct a small fragment library.
  • constructing a large fragment library includes:
  • the captured fragments were end-modified and ligated to a specific linker Adapter to construct a large fragment library.
  • the present invention does not require ultra-high-depth sequencing (at least 200 times) of target species based on a new generation high-throughput sequencing technology without any reference sequence; the cost is greatly reduced without constructing a BAC library; The technical solution of the present invention only takes 6 months, and the existing technical solution takes 1 year, so the cycle is greatly shortened; the progress of the hybrid genome research will be greatly promoted, and the rapid acquisition of the high-quality genome map of the species is provided. Guarantee.
  • FIG. 1 is a schematic flow diagram of a hybrid genome processing method according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a hybrid recognition process in accordance with an embodiment of the present invention.
  • FIG. 3 is a schematic illustration of a hybrid genome processing method in accordance with another embodiment of the present invention.
  • FIG. 4 is a flow diagram showing the construction of a small fragment library in accordance with an embodiment of the present invention.
  • FIG. 5 is a schematic flow chart of constructing a large fragment library according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of an information analysis process in a hybrid genome processing method according to an embodiment of the present invention.
  • FIG. 8 is a distribution diagram of GC content and sequencing depth in accordance with an embodiment of the present invention.
  • FIG. 9 is a schematic illustration of genomic GC content distribution for four species, in accordance with an embodiment of the present invention.
  • Figure 10 is a sequencing depth profile diagram in accordance with an embodiment of the present invention.
  • Figure 11 is a schematic illustration of the homology of a family of proteins of multiple species, in accordance with an embodiment of the present invention.
  • Figure 12 is a schematic illustration of a phylogenetic tree in accordance with an embodiment of the present invention.
  • Figure 13 is a schematic illustration of an estimate of differentiation time and replacement rate, in accordance with an embodiment of the present invention.
  • Figure 14 is a schematic illustration of the conserved sequence length relationship of three genomes of giant panda, dog and human according to an embodiment of the present invention.
  • Figure 15 is a schematic illustration of genomic collinearity of two species, in accordance with an embodiment of the present invention.
  • 16 is a colinear view of two species in accordance with an embodiment of the present invention.
  • 17 is a distribution diagram of large segment copy length and similarity according to an embodiment of the present invention.
  • FIG. 19 is another schematic diagram of the results of genomic self-linear alignment analysis according to an embodiment of the present invention.
  • 20 is a 4DTv profile of two species, in accordance with an embodiment of the present invention.
  • the ultra-high-depth sequencing of the target species (generally more than 200 times) is carried out based on the new generation high-throughput sequencing technology without any reference sequence, and then the new SOAPdenovo2 assembly software is adopted. , combined with hybrid recognition processing to obtain a genome-wide sequence map of the target species.
  • Embodiments of the invention may include genome-wide shotgun (WGS) sequencing, genome assembly, genome annotation and evolution analysis, and personalized analysis to address the challenges for hybrid genomes.
  • WGS genome-wide shotgun
  • the hybrid genome processing method comprises:
  • S101 using the whole genome shotgun method to perform at least 200 times ultra-high depth sequencing of the target species to obtain an effective read length sequence Reads;
  • the hybrid recognition process may include determining a unique k-mer by a k-mer (which represents a short sequence of fixed lengths of a specified value k), ie, only in the genome The k-mer appears once, in which the Unique k-mer comes from the Unique region in the sequence, which represents unique genetic information, which is important for splicing, and Repeat k-mer (repeated k-mer) may come from the sequence Repeating areas that may create potential misspellings.
  • the k-mer curve can be drawn using 17-mer
  • the distribution looks for the Unique k-mer in the genome. Probabilistically, >99% of the unique k-mers are distributed in the region before the 1.5-fold main peak, so the k-mer in this region is defined as the Unique k-mer (the inside of the rectangular box shown in Figure 2 is Unique K- Mer); the Scaffold sequences are arranged from large to small, and then the Unique k-mer is aligned back to the Scaffold sequence, and the area of the Unique k-mer enrichment is called the Unique area, such as A, A' in FIG.
  • B or C is only the other types of sequences in which the repeating sequence of the example genome exists (the area with more repeat k-mer), and the information of these regions is not obtained by sequence alignment; statistics are unique for each Scaffold.
  • the redundant Scaffold will be removed based at least on the assembly length. For example, a short Scaffold with poor assembly quality and short assembly length can be considered redundant and removed from the genome map. , retained in another document to eventually assemble into a complete high-quality genomic map, while retaining heterozygous region information for subsequent genome annotation and genome evolution analysis, as well as personalized analysis.
  • the genome annotations include Repeat annotations, gene predictions, gene function annotations, and ncRNA annotations; genomic evolution analysis includes gene family identification, species phylogenetic tree construction, species divergence time estimation, genomic collinearity analysis, and genome-wide replication analysis .
  • a hybrid genome processing method includes acquiring genomic DNA of a target species, constructing a WGS sequencing library for the genomic DNA, performing at least 200-fold ultra-high-depth sequencing, constructing Contig, and constructing Scaffold obtains a genomic map with redundant sequences, recognizes unique k-mer, and hybrid recognition processing to obtain a genome-wide map.
  • the genome-wide shotgun method is used to perform at least 200-fold ultra-high-depth sequencing of the target species to obtain an effective read length sequence Reads comprising: constructing a library; using the whole-genome shotgun method to perform at least 200 on the constructed library. Multiple ultra-high depth sequencing to obtain the original Reads; the original Reads are filtered to obtain valid Reads.
  • constructing a library comprises constructing a small fragment library and constructing a large fragment library, wherein the small fragment library comprises a plurality of insert libraries of different lengths, the large fragment library comprising a plurality of Mate Pair libraries of different lengths.
  • the small segment data refers to an insert length of less than 1 kb
  • the large segment data refers to an insert length greater than 1 kb.
  • the small fragment library can include a 170 bp, 500 bp, 800 bp insert library; the large fragment library can include a 2 kb, 5 kb, 10 kb, 20 kb, 40 kb pair (Mate Pair) library.
  • constructing a small fragment library includes:
  • S401 interrupting the DNA sample of the qualified target species (ie, fragmenting the DNA sample) to obtain a DNA fragment; wherein, DNA samples of the target species can be obtained by using Nanodrop, Gel-Electrophotometric, or the like.
  • the product is tested and the tested DNA sample is used for library preparation.
  • the DNA sample can be interrupted using ultrasound.
  • S402 End repair of the DNA fragment.
  • the ends can be repaired using T4 DNA polymerase and Klenow polymerase to prepare blunt ends.
  • S403 The 3' end of the DNA fragment repaired at the end is ligated to the base A.
  • the repair product obtained in the previous step is added to base 3 at the 3' end to prepare for the next linker Adapter connection.
  • a DNA fragment linked to the base A is ligated to the adaptor Adapte.
  • the T4 DNA ligase reaction system is configured and reacted in a Thermomixer for a certain period of time to link the adapter to the "A" product.
  • S405 DNA fragment size selection.
  • the corresponding agarose gel is prepared according to the size of the fragment of interest, and the ligation product of the previous step is subjected to a PCR reaction.
  • S406 PCR amplification and product purification.
  • the PCR reaction system is configured to amplify the cut product recovered in the previous step, wherein the PCR enzyme is purified by using Phusion DNA Polymerase and agarose gel to complete small fragment library construction.
  • constructing a large fragment library in accordance with the present invention includes:
  • S501 Randomly break the genomic DNA of the target species to a specific size to obtain a DNA fragment. For example, in one embodiment, a range of 2-10 kb can be selected.
  • the T4 DNA polymerase smoothes the 3' overhang and fills the 5' overhang.
  • the probe is labeled by a displacement synthesis method, and T4PNK phosphorylates the 5' end of the DNA to carry out a ligation reaction, and performs DNA end labeling, and removes the 3' phosphate group with T4 polynucleic acid kinase.
  • Klenow DNA polymerase can fill the 5' overhanging end to form a blunt end; the 3' overhang is cut to form a blunt end. The use of these three enzymes can achieve the purpose of DNA fragment repair and biotin labeling.
  • S503 DNA fragment size selection to obtain a DNA fragment of interest.
  • S504 Cycling the DNA fragment of interest.
  • an appropriate amount of the recovered DNA is subjected to a cyclized enzymatic reaction;
  • S505 Digestion of linear DNA, wherein the cyclized DNA sample is added to a digestion reaction system to digest linear DNA.
  • S506 Breaking the circularized DNA fragment of interest into a fragment of a predetermined length and capturing the fragment bearing the biotin label.
  • the cyclized DNA is disrupted and purified: the cyclized DNA molecule is interrupted into a 400-600 bp fragment and the biotinylated label is labeled by magnetic beads with streptavidin Fragment capture.
  • the overall sequencing depth for both small and large fragment libraries is at least 200 fold.
  • the small fragment library sequencing strategy is 101PE; the large fragment library sequencing strategy is 50PE or 91PE.
  • Take Illumina Hiseq2000 Sequencing provides high genomic single-base sequencing depth and genomic physical coverage, resulting in high single-base accuracy and ensuring genome integrity.
  • the original Reads obtained by sequencing are not all effective. They contain plug-in, repeat, and low-quality Reads. These Reads will affect the assembly and subsequent analysis.
  • the original Reads must be filtered to obtain valid Reads. .
  • the filtered Reads method is as follows:
  • (L-k+1) sequences of length k can be obtained for a sequence of length L.
  • all k-mers taken from bases in Reads can traverse the entire genome, and the frequency at which k-mers appear is subject to Poisson distribution. Sequencing errors can lead to new k-mers, and in general these new k-mer frequencies are relatively low.
  • the amount of sequencing is large enough, it can be considered that the k-mer with low frequency is caused by sequencing errors. Because the large segment is in the process of cyclization when building the library, and only plays a role in the assembly process, error correction is only performed for small segment data.
  • the k-mers on the entire Reads are high-frequency, so it is considered that some errors caused by sequencing are corrected.
  • base depth analysis was performed. The aligned Reads is compared to the spliced genomic sequence using the alignment program SOAPaligner, and then the number of times each base is covered is counted based on the alignment result, so that the percentage of bases of various sequencing depths to the whole genome can be obtained.
  • Huada Gene has independently developed short sequence assembly analysis software (SOAP software) for the characteristics of Illumina Hiseq2000 technology, and has been successfully applied to the assembly of hundreds of genomes such as cucumber, panda, cabbage, ant, potato, millet, yak, oyster, etc. It has accumulated rich experience in genome assembly and can efficiently and accurately splicing and assembling massive data obtained by sequencing Hiseq2000.
  • Soapdenovo has been upgraded to Soapdenovo2, which has many functional improvements and better assembly results than the previous version.
  • Soapdenovo2 has the optional multi-k-mer function, and the multi-k-mer is a new feature of soapdenovo2.
  • Gradient k-mers can be used in Contig builds, which are better than previous versions of a single k-mer assembly.
  • Genomic assembly involves assembly using SOAPdenovo2 assembly software, GC-depth analysis, GC content analysis, sequencing depth analysis, genome assembly result evaluation (autosomal region coverage assessment and gene region coverage assessment).
  • Genomic annotations refer to Repeat annotations, gene predictions, gene function annotations, and ncRNA annotations.
  • Genomic evolution analysis involves gene family identification, species phylogenetic tree construction, species divergence time estimation, genomic collinearity analysis, and genome-wide replication analysis.
  • a k-mer analysis can be performed to estimate the genome size. Specifically, a sequence of k bases is iteratively selected from a continuous sequence. If the length of Reads is L and the length of k-mer is k, L-k+1 k-mers can be obtained. For example, k can be taken as 17 for analysis.
  • the analysis results are shown in Fig. 7: the abscissa is the depth, and the ordinate is the ratio of the k-mer species at all depths to all k-mer species, ie the frequency.
  • Genome Size k-mer_num/Peak_depth
  • the distribution of GC bases or AT bases in the genome sequence of each species exhibits different characteristics, and different GC content may affect sequencing randomness.
  • GC content analysis was performed.
  • the abscissa of Figure 8 is the GC content and the ordinate is the average sequencing depth. The calculation is performed without repeating at 10kp. According to this figure, the GC content of this species can be analyzed, and whether the sample has foreign DNA contamination can be judged.
  • the abscissa is the GC content
  • the ordinate is the proportion of the number of windows under the GC.
  • the assembled genomic sequence was calculated to have a GC content of 500 bp, with a 250 bp overlap between the windows.
  • Figure 9 shows the genomic GC content distribution analysis of four species (Panda, Dog, Human, and Mouse) to compare the differences in GC content distribution between assembled and related species. In general, the GC distribution of related species has a similar trend. Through this map, the genomic sequence differences of the two species can be initially determined.
  • base depth analysis is performed in order to evaluate single base coverage depth and reflect base accuracy.
  • the abscissa is the depth
  • the ordinate is the ratio of the number of bases at that depth to the number of all bases.
  • the assembled or obtained BAC or Fosmid clone sequence is used as a reference, and the assembled genomic sequences are aligned back to the reference sequence (blast e 1e-5), and the assembly is checked. Coverage of known sequences to known sequences (requires coverage of 95% for data from the same individual). This evaluation can judge the integrity of the assembled genome from a macroscopic point of view, and can also check the misassembly of the genome sequence from a microscopic point of view.
  • the published or obtained EST or transcriptome sequence is mapped as a query sequence to the assembled genomic sequence, and the coverage of the known sequence for the known sequence is examined (from the same Species' EST or transcriptome data typically require coverage of 98%).
  • the gene family is first composed of a set of genes from an ancestral gene.
  • the identification of gene families is an important aspect of evolutionary analysis. Single-copy gene families and multi-copy gene families can be obtained by clustering homologous genes and identification analysis of gene families. These families are relatively conservative between species and can be used for the analysis of genetic relationships between species. Genes and families specific to the species can also be obtained, which may be related to the specific phenotype of the species. Second, in the specific operation, in addition to the current sequencing species, a total of 6 to 9 species were selected for subsequent analysis.
  • the gene set is first filtered, ie, genes with no initial, no termination, early termination, and cds sequences other than a multiple of 3 are filtered out, and each gene selects the longest transcript for analysis.
  • the alignment program BlastP was used for comparison, and the gene family was clustered by the TreeFam software and OrthoMCL software.
  • Unclustered genes The number of genes unique to the species (including specific genes not clustered by the species); Unique familys: gene families unique to the species (these gene families only contain genes for the species).
  • FIG. 11 wherein human, dog, cow, mouse and rat represent a placental mammal, wherein A in Fig. 11 is a schematic diagram in which most genes are orthologous genes; B is an orthologous gene family. Schematic diagram.
  • a species development tree is first constructed with each single copy gene family.
  • the rate of variation of different species is related to the individual size or generational cycle of the species. Larger individuals and longer generation cycles generally have slower rates of variation.
  • phylogenetic trees are constructed from a single copy gene family. According to the gene family clustering results, a single-copy gene was selected, and after protein-based muscle alignment, it was converted into a nucleotide sequence and ligated to form a supergene. Based on this, phylogenetic tree was constructed by PhyML or MrBayes reconstruction.
  • Figure 12 is a schematic diagram showing the construction of a phylogenetic tree using a quadruple attachment site of an orthologous gene, wherein each branch length represents a neutral evolution rate; the number on the branch represents dN/dS.
  • a quadruple degenerate site in each single copy gene family is typically used to estimate the molecular clock (replacement rate) and the time between species, where the neutral replacement rate
  • the number of variances per site per year is measured.
  • the query finds the calibration time between any two species. If the fossil calibration time can be found in the literature, the final species divergence time estimation result is credible; if the estimated results of existing literature and website can only be found on the timetree.org website as the calibration time, the final result is for reference only; If the calibration time is not available, the divergence time estimation analysis cannot be performed.
  • the calibration time was added to the phylogenetic tree of the species. According to the supergene sequence, the species divergence time was estimated by the BRMC method using mcmctree or multidivtime software.
  • 2.1e-9, 1.3e-9, 1.8e-9, and 2.6e-9 represent the replacement rate in units of each site per year; 97, 61 (15-176), 14 (3-53) ) indicates the calculated age of differentiation, the unit is millions of years.
  • the age of differentiation between humans and dogs comes to TimeTree database (http://www.timetree.org), which is used as the time for correction.
  • the collinear fragment refers to a large fragment that is produced within the same species or between two species due to replication (genomic replication, chromosomal replication or large fragment replication) or species differentiation. Homology. Within this homologous segment, genes are conserved both functionally and in order. The genes in the collinear fragment maintain a high degree of conservation during species evolution.
  • the collinearity analysis visualizes the massive digital mutual information in the form of graphs. By comparing the two sets of sequencing data, we find the collinear relationship between them. In specific operations, first determine one or two closely related species that require genomic collinearity analysis. For animals, a lastz alignment between the two species is performed.
  • first large fragment replication is a low copy region of the genome that is 1-200 kb in length.
  • these replication regions contain genetic structures such as introns and exons, and therefore, these replication regions cannot be obtained a priori relative to common genomic sequences.
  • Most of the large fragment replication regions that have been reported have been obtained experimentally.
  • Analysis of the human genome revealed a large number of large fragment replication regions and tandem replication regions in the human genome. The data suggest that large fragmentation may play an important role in genome and gene evolution.
  • Whole genome alignment results are an important basis for comparative genomic analysis and are generally used to identify functional elements in the genome.
  • WGAC Animal Large Fragmentation Study
  • the Plant Large Fragment Replication Study performs its own BLASTP alignment with the protein sequences of the currently sequenced species, selects homologous pairs, and uses MCscan to find homology blocks, as shown in Figure 16 for collinear blocks.
  • Plant Whole Genome Replication first selects one or two species based on evolutionary relationships between species. All proteins are put together and BLASTP is aligned. Within the species, homologous genes are selected; between the species, orthologous genes are selected. According to the results of MCscan, the two homologous genes/orthologs were subjected to muscle alignment, converted to nucleotides, 4D sites were extracted, and 4Dtv values were calculated. Take the 4Dtv value of each collinear fragment as shown in Figure 20.
  • WGS sequencing libraries of 170 bp, 500 bp, 800 bp, 2k, 5k, 10k, 20k and 40 kb can be constructed in the present invention, and it is promised to add a 40 kb large fragment library unique to Huada Science for assembly.
  • Sequencing depth increased sequencing depth, Illumina Hiseq high-throughput sequencing, sequencing depth of more than 200 times, hybrid genome solution using genome-wide shotgun (WGS) for sequencing.
  • WGS genome-wide shotgun
  • a detailed genome sequencing strategy was developed, and the overall sequencing depth was more than 200 times to ensure the accuracy of single base and the integrity of the genome.
  • a library of gradient inserts from 170 bp, 500 bp, 800 bp to 2 kb, 10 kb, 20 kb, and 40 kb. Sequencing of both ends of different insert libraries will span more repeats of different sizes and similarities.
  • the quality of a single base is greatly improved by repeating the sequencing of a single base at a high depth (greater than 200 times coverage).
  • Soapdenovo2 adds Contig feature recognition function, special treatment for heterozygous, repeat and other sections, and the connection efficiency is higher.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Organic Chemistry (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明公开了一种杂合基因组处理方法,其包括采用全基因组鸟枪法对目标物种进行至少200倍的超高深度测序,以获得有效的读长短序列Reads;对所述有效的Reads执行组装和构建支架序列Scaffold,以获得带有冗余序列的基因组图谱;对所述带有冗余序列的基因组图谱执行杂合识别处理,从而去除杂合区域中冗余的Scaffold并保留杂合区域信息以获得全基因组图谱。

Description

杂合基因组处理方法 技术领域
本发明涉及基因工程技术领域和基因组生物信息学分析领域。特别地,本发明涉及杂合基因组处理,以获得物种的全基因组序列图谱。
背景技术
全基因组de novo测序也叫从头测序,不需要任何参考序列即可对某个物种进行测序,用生物信息学分析方法进行拼接、组装,从而获得该物种的全基因组序列图谱。获得一个物种的全基因组序列是加快对此物种了解的重要捷径。
随着人类基因组计划启动,世界迈出了基因组测序的重要一步;各种模式的动植物线虫、拟南芥、小鼠等纷纷进行了基因组测序;由于技术水平限制,基因组测序还处于初级阶段,在一些重要物种测序完成后,第一代基因组测序也慢慢走向成熟,伴随而来的是利用第一代测序技术对各种重要物种基因组的测序热潮,各种重要动植物基因组相继被测序。不论这些数字如何令人印象深刻,第一代测序技术在速度和成本方面都已达到了极限,因此,需要开发全新的技术来突破这些局限。另外,第一代测序技术也遇到了一个很大的瓶颈,那就是基因组杂合的问题。基因组杂合的情况广泛存在于具有重要经济或科研价值的物种中,而传统的杂合基因组图谱绘制方案成本高、周期长,严重制约了这些物种基因组的解读。
发明内容
本发明要解决的技术问题在于,针对现有技术中传统的杂合基因组图谱绘制方案成本高、周期长,严重制约了物种基因组的解读的缺陷,提供一种杂合基因组处理方法。
本发明解决其技术问题所采用的技术方案是:构造一种杂合基因组处理方法,其包括:
采用全基因组鸟枪法对目标物种进行至少200倍的超高深度测序,以获得有效的读长短序列Reads;
对所述有效的Reads执行组装和构建支架序列Scaffold,以获得带有冗余序列的基因组图谱;
对所述带有冗余序列的基因组图谱执行杂合识别处理,从而去除杂合区域中冗余的Scaffold并保留杂合区域信息以获得全基因组图谱。
在本发明的杂合基因组处理方法中,所述杂合识别处理包括基于k-mer分布图来识别所述杂合区域。
在本发明的杂合基因组处理方法中,所述基于k-mer分布图来识别所述杂合区域包括:
通过k-mer分布图,识别所述带有冗余序列的基因组图谱中的Unique k-mer区域(唯 一的k-mer区域),其中,所述Unique k-mer表示在基因组中只出现一次的k-mer,k-mer表示定长为指定值k的短序列,k为正整数;
其中,所述Unique k-mer表示在基因组中只出现一次的k-mer,k-mer表示定长为指定值k的短序列,k为正整数;
根据所述Unique k-mer区域来确定所述杂合区域。
在本发明的杂合基因组处理方法中,所述根据所述Unique k-mer区域来确定所述杂合区域包括:
统计每条Scaffold上的Unique k-mer的数量N以及那些在已统计过的一条Scaffold上出现过的Unique k-mer的数量M,其中N和M均为正整数;
如果M/N的值达到预定值,则将这两条Scaffold确定为对应的杂合区域。
在本发明的杂合基因组处理方法中,至少基于组装的长度来去除杂合区域中冗余的Scaffold。
在本发明的杂合基因组处理方法中,在构建支架序列Scaffold之前,还包括构建重叠群序列Contig。
在本发明的杂合基因组处理方法中,还包括对所述全基因组图谱进行基因组注释和/或基因组进化分析。
在本发明的杂合基因组处理方法中,所述基因组注释包括Repeat注释、基因预测、基因功能注释和ncRNA注释;所述基因组进化分析包括基因家族鉴定、物种系统发育树构建、物种分歧时间估算、基因组共线性分析和全基因组复制分析。
在本发明的杂合基因组处理方法中,采用全基因组鸟枪法对目标物种进行至少200倍的超高深度测序,以获得有效的读长短序列Reads包括:
构建文库;
采用全基因组鸟枪法对所构建的文库进行至少200倍的超高深度测序,以获得原始的Reads;
对所述原始的Reads进行过滤,以获得有效的Reads。
在本发明的杂合基因组处理方法中,构建文库包括构建小片段文库和构建大片段文库,其中,小片段文库包括长度不同的多个插入片段文库,大片段文库包括长度不同的多个配对Mate Pair文库。
在本发明的杂合基因组处理方法中,构建小片段文库包括:
将检测合格的目标物种的DNA样品打断,以获得DNA片段;
对所述DNA片段进行末端修复;
将末端修复的DNA片段的3’端连接碱基A;
对连接碱基A的DNA片段连接上接头Adapter;
根据目的片段的长度,对连接上接头Adapter的DNA片段执行PCR反应和PCR扩增,以构建小片段文库。
在本发明的杂合基因组处理方法中,构建大片段文库包括:
将目标物种的基因组DNA打断,以获得DNA片段;
对所述DNA片段进行末端修复和生物素标记;
对经末端修复和生物素标记的DNA片段进行片段选择,以获得目的DNA片段;
对目的DNA片段进行环化并消化线性DNA;
将环化的目的DNA片段打断成预定长度的片段,并捕获带有生物素标记的片段;
对捕获的片段进行末端修饰和连接上特定的接头Adapter,以构建大片段文库。
实施本发明的技术方案,本发明不需要任何参考序列下,基于新一代高通量测序技术,通过对目标物种进行超高深度测序(至少200倍);不需要构建BAC文库,成本大大降低;本发明的技术方案只需要6个月,而现有的技术方案需要1年,因此周期大大缩短;将极大推动杂合基因组研究的进展,为快速获得物种的高质量基因组图谱提供了强有力保障。
附图说明
下面将结合附图及实施例对本发明作进一步说明,附图中:
图1是根据本发明的实施例的杂合基因组处理方法的流程示意图;
图2是根据本发明的实施例的杂合识别处理的示意图;
图3是根据本发明的另一实施例的杂合基因组处理方法的示意图;
图4是根据本发明的实施例的构建小片段文库的流程示意图;
图5是根据本发明的实施例的构建大片段文库的流程示意图;
图6是根据本发明的实施例的杂合基因组处理方法中信息分析流程的示意图;
图7是根据本发明的实施例的17-mer深度频率分布图;
图8是根据本发明的实施例的GC含量与测序深度关系分布图;
图9是根据本发明的实施例的四种物种的基因组GC含量分布的示意图;
图10是根据本发明的实施例的测序深度分布图;
图11是根据本发明的实施例的多个物种的蛋白家族的同源性的示意图;
图12是根据本发明的实施例的系统发育树的示意图;
图13是根据本发明的实施例的分化时间和替换速率的估算的示意图;
图14是根据本发明的实施例的大熊猫、狗和人三个基因组的保守序列长度关系的示意图;
图15是根据本发明的实施例的两个物种的基因组共线性的示意图;
图16是根据本发明的实施例的两个物种的共线性示意图;
图17是根据本发明的实施例的大片段复制长度、相似性的分布图;
图18是根据本发明的实施例的基因组自身共线性比对分析结果的示意图;
图19是根据本发明的实施例的基因组自身共线性比对分析结果的另一示意图;
图20是根据本发明的实施例的两个物种的4DTv分布图。
具体实施方式
本发明的目的、特征和特性以及结构的有关元件的操作方法和功能和部件的组合在参照附图阅读下列描述和以上权利要求时将变得更明显。要明确地理解附图仅仅是为了说明和描述的目的并且不意在作为本发明的限制。
在本发明的杂合基因组处理方法中,不需要任何参考序列下,基于新一代高通量测序技术,通过对目标物种进行超高深度测序(一般在200倍以上),然后采用新版SOAPdenovo2组装软件,结合杂合识别处理,从而获得目标物种的全基因组序列图谱。本发明的实施例可以包括全基因组鸟枪法(WGS)测序、基因组组装、基因组注释和进化分析,以及个性化分析,旨在解决对于杂合基因组的难题。
如图1所示,在本发明的实施例中,杂合基因组处理方法包括:
S101:采用全基因组鸟枪法对目标物种进行至少200倍的超高深度测序,以获得有效的读长短序列Reads;
S102:对所述有效的Reads执行组装和构建支架序列Scaffold,以获得带有冗余序列的基因组图谱;
S103:对所述带有冗余序列的基因组图谱执行杂合识别处理,从而去除杂合区域中冗余的Scaffold并保留杂合区域信息以获得全基因组图谱。在特定实施例中,如图2所示,杂合识别处理过程可以包括:通过k-mer(其表示定长为指定值k的短序列)分布图确定Unique k-mer,即在基因组中只出现一次的k-mer,其中,Unique k-mer来自序列中的Unique区域,其代表着独一无二的基因信息,对拼接有着重要意义,而Repeat k-mer(重复k-mer)可能来自序列中的重复区域,这些重复区域可能带来潜在的错拼。接着,将WGS测序的有效的Reads打断成k-mer,统计k-mer的频率并绘制曲线(在一个实施例中,k-mer曲线绘制的时候可以采用17-mer),利用k-mer分布查找基因组中的Unique k-mer。从概率上来讲,>99%的Unique k-mer分布在1.5倍主峰之前的区域,所以把此区域内的k-mer定义为Unique k-mer(如图2所示矩形框内部为Unique K-mer);将Scaffold序列从大到小排列,然后将Unique k-mer比对回Scaffold序列,Unique k-mer富集的区域称为Unique区域,例如图2中的A、A’。而图2中B或C仅为示例基因组存在重复序列等的其他类型序列(repeat k-mer较多的区域),并不是通过序列比对得到这些区域信息;统计每条Scaffold上Unique  k-mer的数量N,以及那些在已统计过的Scaffold上出现过的Unique k-mer的数量M,如果M/N的值达到预定值(在此,该预定值可根据实际实现和需要而灵活设置,为此不进行特别限定),说明这条序列上的Unique区域与已统计过的其他较长Scaffold的Unique区域存在杂合现象。如果两条Scaffold被确定为对应杂合区域,将至少基于组装长度,将冗余的Scaffold去除,例如可以将一条组装质量较差,组装长度的短的Scaffold视为冗余,从基因组图谱中去除,保留在另一份文件中,以最终组装成完整高质量基因组图谱,同时保留杂合区域信息,以便后续进行基因组注释和基因组进化分析,以及个性化分析。在一个实施例中,基因组注释包括Repeat注释、基因预测、基因功能注释和ncRNA注释;基因组进化分析包括基因家族鉴定、物种系统发育树构建、物种分歧时间估算、基因组共线性分析和全基因组复制分析。由此,本发明的技术方案与WGS+BAC to BAC方案相比不需要构建BAC文库,成本大大降低,以及只需要6个月,而WGS+BAC to BAC方案需要1年。
另外,在一实施例中,在构建支架序列Scaffold之前,还包括构建重叠群序列Contig。如图3所示,根据本发明的另一实施例的杂合基因组处理方法包括获取目标物种的基因组DNA、针对该基因组DNA构建WGS测序文库、进行至少200倍超高深度测序、构建Contig、构建Scaffold以获得带冗余序列的基因组图谱、识别Unique k-mer、杂合识别处理以获得全基因组图谱。
在一个实施例中,采用全基因组鸟枪法对目标物种进行至少200倍的超高深度测序,以获得有效的读长短序列Reads包括:构建文库;采用全基因组鸟枪法对所构建的文库进行至少200倍的超高深度测序,以获得原始的Reads;对所述原始的Reads进行过滤,以获得有效的Reads。
在一个实施例中,构建文库包括构建小片段文库和构建大片段文库,其中,小片段文库包括长度不同的多个插入片段文库,大片段文库包括长度不同的多个Mate Pair文库。在本发明的实施例中,小片段数据指插入片段长度小于1kb,大片段数据指插入片段长度大于1kb。
在特定的实施例中,构建170bp、500bp、800bp、2k、5k、10k、20k和40kb共8种WGS测序文库。根据基因组大小和重复序列的具体特点,制定详细的基因组测序策略,整体测序深度在200倍以上,以保证单碱基的准确性和基因组的完整性。小片段文库可包括170bp、500bp、800bp插入片段文库;大片段文库可包括2kb、5kb、10kb、20kb、40kb配对(Mate Pair)文库。
如图4所示,根据本发明的实施例的构建小片段文库包括:
S401:将检测合格的目标物种的DNA样品打断(即对DNA样品进行片段化),以获得DNA片段;其中,可以利用Nanodrop、Gel-Electrophotometric等对目标物种的DNA样 品进行检测,经检测合格的DNA样品用于文库制备。另外,可以使用超声打断DNA样品。
S402:对DNA片段进行末端修复。在一个实施例中,可以使用T4DNA聚合酶和Klenow聚合酶修复末端,制备平末端。
S403:将末端修复的DNA片段的3’端连接碱基A。在一个实施例中,在聚合酶体系的作用下,使上一步得到的修复产物在3’端加上碱基A,为下一步的接头Adapter连接做准备。
S404:对连接碱基A的DNA片段连接上接头Adapte。在一个实施例中,配置T4DNA连接酶反应体系,在Thermomixer中适温反应一定时间使adapter和加“A”产物连接。
S405:DNA片段大小选择。在一个实施例中,根据目的片段大小的不同制备相应的琼脂糖凝胶,对上一步的连接产物进行PCR反应。
S406:PCR扩增及产物纯化。在一个实施例中,配置PCR反应体系对上一步的切胶回收产物进行扩增,其中PCR酶选用Phusion DNA Polymerase,琼脂糖凝胶纯化,从而完成小片段文库构建。
如图5所示,根据本发明的构建大片段文库包括:
S501:将目标物种的基因组DNA随机打断到特定大小,以得到DNA片段。例如,在一个实施例中,可选择2-10kb的范围。
S502:对DNA片段进行末端修复和生物素标记。在一个实施例中,T4DNA聚合酶可将3’突出末端平滑化,将5’突出末端补平。同时通过置换合成法标记探针,T4PNK可将DNA5’端磷酸化,以便进行连接反应,并进行DNA末端标记,用T4多聚核酸激酶除去3’磷酸基团。Klenow DNA聚合酶可补平5’突出末端,形成平末端;切除3’突出末端,形成平末端。利用这三种酶的作用,可达到DNA片段修复和生物素标记的目的。
S503:DNA片段大小选择,以获得目的DNA片段。
S504:对目的DNA片段进行环化。在一个实施例中,取适量回收后的DNA进行环化的酶反应;
S505:线形DNA的消化,其中,将环化后的DNA样品加入消化反应体系中,消化线性DNA。
S506:将环化的目的DNA片段打断成预定长度的片段,并捕获带有生物素标记的片段。在一个实施例中,破碎环化的DNA并纯化:把环化后的DNA分子打断成400-600bp的片段,并通过带有链亲和霉素的磁珠把那些带有生物素标记的片段捕获。
S507:对捕获的片段进行末端修饰和连接上特定的接头Adapter,以构建大片段文库。
在一个实施例中,对小片段文库和大片段文库的整体测序深度均至少为200倍。小片段文库测序策略为101PE;大片段文库测序策略为50PE或91PE。采取Illumina Hiseq2000 测序,可得到很高的基因组单碱基测序深度和基因组物理覆盖度,从而带来高的单碱基准确率,并确保基因组的完整性。
然而,测序得到的原始Reads,并不都是有效的,里面含有带接头的,重复的,测序质量很低的Reads,这些Reads会影响组装和后续分析,必须对原始的Reads过滤,得到有效Reads。过滤的Reads方法如下:
a)截去5’端和3’端碱基GC分叉严重的部分;
b)去除N碱基(含量超过2%)或者是poly A的Reads;
c)对于小片段数据去除低质量(质量值<=7)碱基数目达到一定程度的Reads(取不大于40%);对于大片段数据去除低质量碱基数目达到一定程度的Reads(取不大于60%);。
d)去除有Adapter污染的Reads(与Adapter序列至少10bp比对上,且失配数不多于3个。
e)去除Read1和Read2有重叠的Reads(Read1和Read2重叠至少10bp,且失配低于10%。对测通的Reads不进行此操作。
f)去除重复的Reads(Read1和Read2完全一样才算是重复)。
利用k-mer进行数据纠错,对于一个长度为L的序列可以得到(L-k+1)个长度为k的序列。在数据量足够的情况下,从Reads中逐碱基取出的所有k-mer能够遍历整个基因组,并且k-mer出现的频数是服从泊松分布的。测序错误会导致新的k-mer出现,一般来说这些新的k-mer频数都是比较低的。当测序量足够大的情况下,可以认为频数低的k-mer是因为测序错误导致的。因为大片段在建库的时候经过的环化的过程,而且在组装过程中只起到定位的作用,所以纠错仅针对小片段数据进行。先利用过滤后数据建立k-mer频数表,设置切割将k-mer分为高频的和低频的。对于有低频k-mer出现的Reads,通过改变某些碱基可以使得整个Reads上的k-mer都为高频,那么就认为纠正了一些测序导致的错误。完成基因组组装后,为了评价单碱基覆盖深度并反映碱基准确性,进行碱基深度分析。采用比对程序SOAPaligner将过滤之后的Reads比对回拼接的基因组序列上,然后根据比对结果统计每个碱基被覆盖的次数,从而可以得到各种测序深度的碱基占全基因组的百分比。华大基因自主研发了针对于Illumina Hiseq2000技术特点的短序列组装分析软件(SOAP软件),并已成功应用于黄瓜、熊猫、白菜、蚂蚁、马铃薯、谷子、牦牛、牡蛎等上百个基因组的组装,积累了丰富的基因组组装经验,能高效准确地对Hiseq2000测序得到的海量数据进行拼接组装。目前Soapdenovo已经升级到Soapdenovo2,较之前的版本有多项功能改进,组装效果更好。在Contig构建中,Soapdenovo2具有可选的多k-mer的功能,多k-mer是soapdenovo2的新特性。可以在Contig构建中使用梯度k-mer,较之前版本的单一k-mer组装,效果更好。
如图6所示,在根据本发明的实施例的杂合基因组处理方法中信息分析流程中涉及对大片段文库和小片段文库进行测序所得的原始Reads进行数据过滤、基因组组装、基因组注释和基因组进化分析。基因组组装涉及采用SOAPdenovo2组装软件进行组装、GC-深度分析、GC含量分析、测序深度分析、基因组组装结果评估(常染色体区域覆盖度评估和基因区覆盖度评估)。基因组注释涉及Repeat注释、基因预测、基因功能注释和ncRNA注释。基因组进化分析涉及基因家族鉴定、物种系统发育树构建、物种分歧时间估算、基因组共线性分析和全基因组复制分析。
在一个实施例中,如图7所示,对于GC-深度分析,可执行k-mer分析估计基因组大小。具体地,从一段连续序列中迭代地选取长度为k个碱基的序列,若Reads长度为L,k-mer长度为k,那么可以得到L-k+1个k-mer。例如,可取k为17来进行分析。分析结果如图7:图中横坐标为深度,纵坐标为各深度下的k-mer种类占所有k-mer种类的比例,即频率。
使用高质量测序数据,逐碱基取17-mer获得深度频数分布图,如图7示出深度峰值的位置,因此,根据公式Genome Size=k-mer_num/Peak_depth,可以估算出该物种的基因组大小。本发明的技术方案与现有技术方案最大的区别是k-mer大小的设置和一些参数的调节:在组装的时候用的是大k-mer进行组装,一般是取k=61以上,而现有技术大多取的是k=30多或40多。大k-mer能跨过更长的重复序列和更多的杂合位点,但是它对数据量的要求非常高,以61-mer为例,它至少需要150-160倍以上的测序深度,而63-mer则要求160-180倍测序深度。k值取的越大,要求的测序量越大。
在一个实施例中,如图8所示,对于GC含量分析,各个物种基因组序列中GC碱基或AT碱基的分布呈现不同特征,而且不同的GC含量可能影响测序随机性。为了分析该基因组序列的碱基分布特征和测序随机性,进行GC含量分析。图8横坐标是GC含量,纵坐标是平均测序深度。以10kp为窗口无重复进行计算。根据这个图可以分析这个物种的GC含量,可以对该样品是否有外源DNA污染进行判断。
如图9所示横坐标为GC含量,纵坐标为该GC下的窗口数目所占比例。将组装的基因组序列以500bp为窗口计算GC含量,窗口之间有250bp重叠。如图9示出四种物种(Panda、Dog、Human和Mouse)的基因组GC含量分布分析可以比较组装基因组与近缘物种的GC含量分布存在的差异。一般情况下,近缘物种的GC分布有相似的趋势,通过这个分布图,可以初步判断两个物种的基因组序列差异。
在一个实施例中,如图10所示,对于测序深度分析,完成基因组组装后,为了评价单碱基覆盖深度并反映碱基准确性,进行碱基深度分析。如图10横坐标为深度,纵坐标为该深度的碱基数目占所有碱基数目的比例。采用比对程序SOAPaligner将过滤之后的Reads 比对回拼接的基因组序列上,然后根据比对结果统计每个碱基被覆盖的次数,从而可以得到各种测序深度的碱基占全基因组的百分比。
在一个实施例中,对于常染色体区域覆盖度评估,把已经公布或者获得的BAC或Fosmid克隆序列作为参考,将组装完成的基因组序列比对回参考序列上(blast e 1e-5),检查组装的序列对已知序列的覆盖度(对于来自于同一个体的数据,要求覆盖度达到95%水平)。这个评价即可以从宏观的角度判断组装基因组的完整度,还可以从微观的角度去检查基因组序列的错误组装。
在一个实施例中,对于基因区覆盖度评估,把已公布或者获得的EST或转录组序列作为query序列映射到组装完成的基因组序列上,检查组装序列对已知序列的覆盖度(来自于同一物种的EST或转录组数据一般要求覆盖度达到98%水平)。
在一个实施例中,对于基因聚类分析,首先基因家族是由来至一个祖先基因的一组基因组成。基因家族的鉴定,是进化分析很重要的一个方面。通过同源基因的聚类及基因家族的鉴定分析,可以得到单拷贝基因家族和多拷贝基因家族。这些家族在物种之间都是比较保守的,可用于物种间亲缘关系的分析。还可以得到物种特有的基因和家族,它们可能和物种的特异性表型有关。其次,在进行具体操作时,除当前测序物种外,共选取6到9个物种进行后续分析。首先过滤基因集,即过滤掉无起始、无终止、提前终止、cds序列不是3的倍数的基因,每一个基因选取最长的转录本进行分析。对所有的蛋白质,用比对程序BlastP做比对,对动植物分别用TreeFam软件和OrthoMCL软件进行基因家族聚类分析。
注:Unclustered genes:该物种特有的基因的个数(包含该物种未聚类的特异基因);Unique familys:该物种特有的基因家族(这些基因家族只包含该物种的基因)。
如图11所示,其中人、狗、牛、小鼠和大鼠代表胎盘哺乳动物,其中,图11中的A是大多数的基因是直系同源基因的示意图;B是直系同源基因家族的示意图。
在一个实施例中,对于系统发育分析,首先用每单拷贝基因家族构建物种发育树。不同物种的变异速率和该物种的个体大小或者世代周期有关,较大的个体、较长的世代周期一般有较慢的变异速率。其次,在具体操作时,系统发育树是根据单拷贝基因家族来构建的。根据基因家族聚类结果,选取单拷贝基因,利用基于蛋白序列的muscle比对后,转换为核苷酸序列,连接形成超基因,在此基础上利用PhyML或MrBayes重构构建系统发育树。
如图12所示为利用直系同源基因的四重兼并位点构建系统发育树的示意图,其中,每个分支长度代表中性进化速率;树枝上的数字代表dN/dS。
在一个实施例中,对于物种分歧时间估算,首先每个单拷贝基因家族中的四重简并位点,通常用以估算分子钟(替换速率)以及物种间的分化时间,其中中性替换速率一般用 每个位点每年的变异数来衡量。其次在具体操作时,需要确定标定时间:根据系统发育树分析涉及的物种,查询找到任意两个物种之间的标定时间。如果能在文献中找到化石标定时间,则最终物种分歧时间估算结果可信;如果只能在timetree.org网站找到已有文献、网站的估算结果作为标定时间,则最终结果仅供参考;如果找不到标定时间,则无法进行分歧时间估算分析。将标定时间加到物种系统发育树中,据超基因序列,利用mcmctree或multidivtime软件,以BRMC方法估算物种分歧时间。
如图13所示2.1e-9、1.3e-9、1.8e-9和2.6e-9代表替换速率,单位是每个位点每年;97、61(15-176)、14(3-53)表示算出来的分化年代,单位是百万年。人和狗的分化年代来至TimeTree database(http://www.timetree.org),用来作为校正的时间。
在一个实施例中,对于基因组共线性分析,首先共线性片段指同一个物种内部或者两个物种之间,由于复制(基因组复制、染色体复制或者大片段复制)或者物种分化而产生的大片段的同源性现象。在该同源片段内部,基因在功能上以及排列顺序上都是保守的。共线性片段中的基因在物种进化过程中保持了高度的保守性。共线性分析将海量数字相互信息用图的形式可视化的表现出来,对两组测序数据通过比对分析,发现他们之间存在的共线性关系。在具体操作时,先确定1到2个需要进行基因组共线性分析的近缘物种。对于动物,进行两个物种之间的lastz比对。如果选择了2个近缘物种,还可以接着用multiz进行多物种的序列比对,如果是3个物种的基因组比对,还需要做出图13所示的分化时间和替换速率的估算。对于植物,首先进行两个物种之间的BLASTP比对。选取同源成对(相互blastp的最佳比对结果),再用MCscan寻找物种间同源blocks,选取比对较好的blocks进行作图分析物种间的共线性如图14所示。
如图15所示,图中的点和线表示局部的比对结果。若有连好的染色体,此处可以使用Circos画图,如图15所示。
在一个实施例中,如图16-20所示,对于全基因组复制分析,首先大片段复制是基因组中长度为1-200kb的低拷贝区域。通常境况下这些复制区域包含内含子和外显子等基因结构,因此,相对于普通的基因组序列,这些复制区域并不能靠先验获取。大部分的已报道过的大片段复制区域都是用实验分析的方法获取的。对人类基因组的分析表明,在人的基因组中有大量散布的大片段复制区域,以及串联复制区域。资料显示大片段复制现象可能在基因组和基因进化方面起着重要的作用。全基因组比对结果是比较基因组分析中的一个重要基础,它一般用于识别基因组中的功能元件。例如,通过基因组的多序列比对结果得到的多个远缘物种的同源序列一般暗示着这些序列是保守的,具有一定的生物特性。在操作时,动物大片段复制研究(WGAC)对当前测序物种基因组序列进行自身lastz比对分析,在比对之前,先屏蔽重复序列,再将比对的序列切分为多个小的文件。最大允许的比对空 隙为100bp。比对之后,过滤出长度大于1kbp,identity(同一性)大于90%的,且去除了重叠部分的区域,这些区域即检测出的大片段复制区域。植物大片段复制研究(WGAC)用当前测序物种的蛋白序列进行自身BLASTP比对,选取同源成对,再用MCscan寻找同源性blocks,根据共线性blocks画图如图16所示。植物全基因组复制(WGD)首先根据物种之间的进化关系,选取1到2个物种。将所有蛋白合在一起,BLASTP比对。物种内部,选取同源基因;物种之间,选取直系同源基因。根据MCscan结果,对两两同源基因/直系同源进行muscle比对,转换为核苷酸,提取4D位点,计算4Dtv值。取每个共线性片段的4Dtv值,如图20所示。
植物基因组自身共线性比对分析结果如图18所示。若有连好的染色体,此处可以使用Ciross画图,如图19所示。
实施本发明的实施例的杂合基因组处理方法,具有以下优势:
测序文库类型增加,在本发明中可构建170bp、500bp、800bp、2k、5k、10k、20k和40kb共8种WGS测序文库,承诺加入了华大科技独有的40kb大片段文库用于组装。
测序深度,测序深度提高,Illumina Hiseq高通量测序,测序深度达到200倍以上,杂合基因组解决方案采用全基因组鸟枪法(WGS)进行测序。根据基因组大小和重复序列的具体特点,制定详细的基因组测序策略,整体测序深度在200倍以上,以保证单碱基的准确性和基因组的完整性。根据组装需要,我们要进行从170bp,500bp,800bp到2kb,10kb,20kb,40kb梯度插入片断的文库的两端测序。不同插入片段文库的两端测序将跨过更多不同大小和相似度的重复序列。通过对单碱基的高深度重复测序(大于200倍覆盖度),使单碱基的质量大大提高。
算法增加:我们会构建不同长度的插入片段文库,并采用双末端测序以提高组装的准确性。Contig构建。我们采用SOAP denovo软件,针对于Illumina Hiseq2000技术特点的短序列组装分析软件(SOAP软件),并已成功应用于黄瓜、熊猫、白菜、蚂蚁、马铃薯、谷子、牦牛、牡蛎等上百个基因组的组装,积累了丰富的基因组组装经验,能高效准确地对Hiseq2000测序得到的海量数据进行拼接组装。目前Soapdenovo已经升级到Soapdenovo2,较之前的版本有多项功能改进,组装效果更好。在Contig构建中,Soapdenovo2具有可选的多k-mer的功能,较之前版本的单一k-mer组装,效果更好。使用SOAPDenovo2进行组装,参数使用如下:
SOAPdenovo-127mer pregraph-s lib_120X.cfg-o tobacco-K 77-p 32-R-d1>pregraph.log
SOAPdenovo-127mer contig-g tobacco-R>contig.log
SOAPdenovo-127mer map-s lib_120X.cfg-g tobacco-f-p 32-k 45>map.log
SOAPdenovo-127mer scaff-g tobacco-N xxx 2>scaff.log
对于Scaffold构建,Soapdenovo2增加Contig特性识别功能,对杂合、重复等区段做特殊处理,连接效率更高。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的权利要求范围之内。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。

Claims (10)

  1. 一种杂合基因组处理方法,其特征在于,包括:
    采用全基因组鸟枪法对目标物种进行至少200倍的超高深度测序,以获得有效的读长短序列Reads;
    对所述有效的Reads执行组装和构建支架序列Scaffold,以获得带有冗余序列的基因组图谱;
    对所述带有冗余序列的基因组图谱执行杂合识别处理,从而去除杂合区域中冗余的Scaffold并保留杂合区域信息以获得全基因组图谱。
  2. 根据权利要求1所述的杂合基因组处理方法,其特征在于,所述杂合识别处理包括基于k-mer分布图来识别所述杂合区域。
  3. 根据权利要求2所述的杂合基因组处理方法,其特征在于,所述基于k-mer分布图来识别所述杂合区域包括:
    通过k-mer分布图,识别所述带有冗余序列的基因组图谱中的Unique k-mer区域,其中,所述Unique k-mer表示在基因组中只出现一次的k-mer,k-mer表示定长为指定值k的短序列,k为正整数;
    根据所述Unique k-mer区域来确定所述杂合区域。
  4. 根据权利要求3所述的杂合基因组处理方法,其特征在于,所述根据所述Unique k-mer区域来确定所述杂合区域包括:
    统计每条Scaffold上的Unique k-mer的数量N以及那些在已统计过的一条Scaffold上出现过的Unique k-mer的数量M,其中N和M均为正整数;
    如果M/N的值达到预定值,则将这两条Scaffold确定为对应的杂合区域。
  5. 根据权利要求1-4任一所述的杂合基因组处理方法,其特征在于,至少基于组装的长度来去除杂合区域中冗余的Scaffold。
  6. 根据权利要求1-4任一所述的杂合基因组处理方法,其特征在于,在构建支架序列Scaffold之前,还包括构建重叠群序列Contig。
  7. 根据权利要求1-4任一所述的杂合基因组处理方法,其特征在于,还包括对所述全基因组图谱进行基因组注释和/或基因组进化分析。
  8. 根据权利要求7所述的杂合基因组处理方法,其特征在于,所述基因组注释包括Repeat注释、基因预测、基因功能注释和ncRNA注释;所述基因组进化分析包括基因家族鉴定、物种系统发育树构建、物种分歧时间估算、基因组共线性分析和全基因组复制分析。
  9. 根据权利要求1-4任一项所述的杂合基因组处理方法,其特征在于,采用全基因组 鸟枪法对目标物种进行至少200倍的超高深度测序,以获得有效的读长短序列Reads包括:
    构建文库;
    采用全基因组鸟枪法对所构建的文库进行至少200倍的超高深度测序,以获得原始的Reads;
    对所述原始的Reads进行过滤,以获得有效的Reads。
  10. 根据权利要求9所述的杂合基因组处理方法,其特征在于,构建文库包括构建小片段文库和构建大片段文库,其中,小片段文库包括长度不同的多个插入片段文库,大片段文库包括长度不同的多个配对Mate Pair文库。
PCT/CN2015/075871 2014-04-04 2015-04-03 杂合基因组处理方法 WO2015149719A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410137420.4 2014-04-04
CN201410137420.4A CN104164479B (zh) 2014-04-04 2014-04-04 杂合基因组处理方法

Publications (1)

Publication Number Publication Date
WO2015149719A1 true WO2015149719A1 (zh) 2015-10-08

Family

ID=51908460

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/075871 WO2015149719A1 (zh) 2014-04-04 2015-04-03 杂合基因组处理方法

Country Status (2)

Country Link
CN (1) CN104164479B (zh)
WO (1) WO2015149719A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111564182A (zh) * 2020-05-12 2020-08-21 西藏自治区农牧科学院水产科学研究所 一种高重复原鮡属鱼类的染色体级别组装的方法
CN116343919A (zh) * 2023-04-11 2023-06-27 天津大学四川创新研究院 一种全基因组图谱绘制测序方法
CN117418039A (zh) * 2023-12-18 2024-01-19 北京林业大学 一种基于基因组学技术筛选轻木生长-防御候选基因的方法

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104164479B (zh) * 2014-04-04 2017-09-19 深圳华大基因科技服务有限公司 杂合基因组处理方法
CN104573407B (zh) * 2015-02-10 2017-05-24 东南大学 一种物种特异性内源性条形码的搜索方法及其在多样本混合测序中的应用
CN105002570B (zh) * 2015-07-21 2017-09-05 中国农业科学院深圳农业基因组研究所 一种一次制备多个dna大片段插入双末端测序文库的方法
CN105420375B (zh) * 2015-12-24 2020-01-21 北京大学 一种环境微生物基因组草图的构建方法
CN105631242B (zh) * 2015-12-25 2018-09-11 中国农业大学 一种利用全基因组测序数据鉴定转基因事件的方法
CN107038349B (zh) * 2016-02-03 2020-03-31 深圳华大生命科学研究院 确定重排前v/j基因序列的方法和装置
CN107133493B (zh) * 2016-02-26 2020-01-14 中国科学院数学与系统科学研究院 基因组序列的组装方法、结构变异探测方法和相应的系统
WO2017143585A1 (zh) * 2016-02-26 2017-08-31 深圳华大基因研究院 对分隔长片段序列进行组装的方法和装置
CN106022003B (zh) * 2016-05-17 2019-03-29 杭州和壹基因科技有限公司 一种基于三代PacBio测序数据的scaffold构建方法
CN106022002B (zh) * 2016-05-17 2019-03-29 杭州和壹基因科技有限公司 一种基于三代PacBio测序数据的补洞方法
CN106021997B (zh) * 2016-05-17 2019-03-29 杭州和壹基因科技有限公司 一种三代PacBio测序数据的比对方法
CN106021985B (zh) * 2016-05-17 2019-03-29 杭州和壹基因科技有限公司 一种基因组数据压缩方法
CN106951731B (zh) * 2017-03-28 2019-04-12 至本医疗科技(上海)有限公司 一种大片段插入或缺失的预测方法及系统
CN108660197A (zh) * 2017-04-01 2018-10-16 深圳华大基因科技服务有限公司 一种二代序列基因组重叠群的组装方法和系统
CN111028889B (zh) * 2019-12-03 2021-04-20 广西壮族自治区农业科学院 一种获得活体营养型植物病原卵菌无污染基因组的方法
CN111445953B (zh) * 2020-03-27 2022-04-26 武汉古奥基因科技有限公司 一种利用全基因组比对拆分四倍体鱼类亚基因组的方法
CN112786110B (zh) * 2021-01-29 2023-08-15 武汉希望组生物科技有限公司 一种序列组装方法及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103080333A (zh) * 2010-09-14 2013-05-01 深圳华大基因科技有限公司 一种基因组结构性变异检测方法和系统
CN103382471A (zh) * 2012-05-02 2013-11-06 董志平 谷子抗锈基因共分离的分子标记及其检测方法
CN104164479A (zh) * 2014-04-04 2014-11-26 深圳华大基因科技服务有限公司 杂合基因组处理方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001063543A2 (en) * 2000-02-22 2001-08-30 Pe Corporation (Ny) Method and system for the assembly of a whole genome using a shot-gun data set
US20090298064A1 (en) * 2008-05-29 2009-12-03 Serafim Batzoglou Genomic Sequencing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103080333A (zh) * 2010-09-14 2013-05-01 深圳华大基因科技有限公司 一种基因组结构性变异检测方法和系统
CN103382471A (zh) * 2012-05-02 2013-11-06 董志平 谷子抗锈基因共分离的分子标记及其检测方法
CN104164479A (zh) * 2014-04-04 2014-11-26 深圳华大基因科技服务有限公司 杂合基因组处理方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KELL EHER, C.T. ET AL.: "A Physical Map of the Highly Heterozygous Populus Genome: Integration with the Genome Sequence and Genetic Map and Analysis of Haplotype Variation", THE PLANT JOURNAL, vol. 50, no. 6, 30 June 2007 (2007-06-30), XP055228911 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111564182A (zh) * 2020-05-12 2020-08-21 西藏自治区农牧科学院水产科学研究所 一种高重复原鮡属鱼类的染色体级别组装的方法
CN111564182B (zh) * 2020-05-12 2024-02-09 西藏自治区农牧科学院水产科学研究所 一种高重复原鮡属鱼类的染色体级别组装的方法
CN116343919A (zh) * 2023-04-11 2023-06-27 天津大学四川创新研究院 一种全基因组图谱绘制测序方法
CN116343919B (zh) * 2023-04-11 2023-12-08 天津大学四川创新研究院 一种全基因组图谱绘制测序方法
CN117418039A (zh) * 2023-12-18 2024-01-19 北京林业大学 一种基于基因组学技术筛选轻木生长-防御候选基因的方法

Also Published As

Publication number Publication date
CN104164479A (zh) 2014-11-26
CN104164479B (zh) 2017-09-19

Similar Documents

Publication Publication Date Title
WO2015149719A1 (zh) 杂合基因组处理方法
JP7284849B2 (ja) 不均一分子長を有するユニーク分子インデックスセットの生成およびエラー補正のための方法およびシステム
Ling et al. Genome sequence of the progenitor of wheat A subgenome Triticum urartu
Zimin et al. Sequencing and assembly of the 22-Gb loblolly pine genome
JP6685324B2 (ja) 特異的分子インデックス(umi)を有する冗長リードを用いたシーケンシングdna断片におけるエラーの抑制
US20210317518A1 (en) Sequencing controls
KR102458022B1 (ko) 혼합물 중 핵산의 서열분석 방법 및 그와 관련된 조성물
US20190276884A1 (en) Methods for assembling and reading nucleic acid sequences from mixed populations
Martin et al. Next-generation transcriptome assembly
Troll et al. A ligation-based single-stranded library preparation method to analyze cell-free DNA and synthetic oligos
JP6017458B2 (ja) 大量並列連続性マッピング
Cullum et al. The next generation: using new sequencing technologies to analyse gene regulation
Johnson et al. Single nucleotide analysis of cytosine methylation by whole‐genome shotgun bisulfite sequencing
Babarinde et al. Computational methods for mapping, assembly and quantification for coding and non-coding transcripts
Scheibye-Alsing et al. Sequence assembly
JP2014518638A (ja) ヌクレオチド配列データの提供
Beier et al. Multiplex sequencing of bacterial artificial chromosomes for assembling complex plant genomes
JP2018509928A (ja) 環状化メイトペアライブラリーおよびショットガン配列決定を用いて、ゲノム変異を検出するための方法
Monger et al. Towards next generation CHO cell biology: Bioinformatics methods for RNA‐Seq‐based expression profiling
Goswami et al. RNA-Seq for revealing the function of the transcriptome
WO2013152505A1 (zh) 一种转录组组装的方法及系统
US20230416812A1 (en) Method capable of making one cluster by connecting information of strands generated during pcr process and tracking generation order of generated strands
Lu et al. A high-quality assembled genome and its comparative analysis decode the adaptive molecular mechanism of the number one Chinese cotton variety CRI-12
CN109929947B (zh) 一种基于RNA-Seq开发的小扁豆KASP标记及应用
Stodola et al. Genome-wide map of proximity linkage to renin proximal promoter in rat

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15772305

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15772305

Country of ref document: EP

Kind code of ref document: A1