CN115810395B - T2T assembly method based on high-throughput sequencing animal and plant genome - Google Patents

T2T assembly method based on high-throughput sequencing animal and plant genome Download PDF

Info

Publication number
CN115810395B
CN115810395B CN202211549592.3A CN202211549592A CN115810395B CN 115810395 B CN115810395 B CN 115810395B CN 202211549592 A CN202211549592 A CN 202211549592A CN 115810395 B CN115810395 B CN 115810395B
Authority
CN
China
Prior art keywords
genome
data
assembly
sequencing
ont
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211549592.3A
Other languages
Chinese (zh)
Other versions
CN115810395A (en
Inventor
舒自强
蒋冕
樊鹏宇
杨路路
陈刚
苏晓俊
田朝阳
宋驰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Bena Technology Co ltd
Original Assignee
Wuhan Bena Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Bena Technology Co ltd filed Critical Wuhan Bena Technology Co ltd
Priority to CN202211549592.3A priority Critical patent/CN115810395B/en
Publication of CN115810395A publication Critical patent/CN115810395A/en
Application granted granted Critical
Publication of CN115810395B publication Critical patent/CN115810395B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a T2T assembly method based on high-throughput sequencing animal and plant genome, which comprises the following steps: s1, based on sample sequencing, respectively performing ONT ultra-long sequencing data assembly and HiFi sequencing data assembly; s2, removing redundancy of assembly results of the two routes based on a genome size evaluation result; s3, respectively carrying out HiC mounting on the assembly results of the two routes to obtain a chromosome level reference genome gap; s4, performing third-generation error correction on the ONT line genome centromere region, and performing second-generation error correction on the ONT line genome based on second-generation data; s5, comparing terminal fragments based on original data, reassembling according to the abundance of the telomere repeated base sequence, replacing terminal telomere sequences through comparison, and extending telomeres of two lines; s6, respectively supplementing gap and HiFi error correction to the genome sequence; s7, adjusting chromosomes by combining genome sequencing of known species to obtain assembled genomes; the T2T genome telomeres are complete and have no gaps, so that the research progress of animals and plants is promoted.

Description

T2T assembly method based on high-throughput sequencing animal and plant genome
Technical Field
The invention relates to the technical field of gene assembly, in particular to a T2T assembly method based on high-throughput sequencing of animal and plant genome.
Background
High-throughput sequencing technology (High-throughput sequencing) is also known as Next-generation sequencing technology (Next-generation sequencing technology). High-throughput sequencing technology has now come to the fourth generation, nanopore sequencing technology (also known as fourth generation sequencing technology) is a new generation of sequencing technology that has emerged in recent years. The current sequencing length can reach 4Mb. This technology began in the 90 s and underwent three major technological innovations: 1. single molecule DNA passes through the nanopore; 2. the enzyme on the nanopore controls the single nucleotide precision of the sequencing molecule; 3. sequencing accuracy control of single nucleotides. With the continuous development of species genome research, sequencing technology is continuously improved, and the continuity and the integrity of the species genome are also greatly improved. Species genomes range from draft genome (genome 1.0) assembled by common second generation sequencing, to near complete genome (genome 3.0) assembled by ONT (N50 >50Kb or R10) sequencing techniques, to T2tgen genome (genome 4) assembled by binding to ontura-long (oxford Kong Chaochang sequencing techniques) and pacb High definition (High accuracy read length sequencing techniques) sequencing techniques. Animal and plant T2T genome refers to a horizontal genome obtained by combining HiFi and second generation data (high throughput sequencing, such as Illumina HiSeqTM/MiseqTM) through ONTulter-longN 50>100Kb (sequencing read length N50 is more than 1000000, N50: N50 is an evaluation index after genome splicing, all sequences obtained by splicing are sequenced from big to small according to sequence sizes, then accumulation is gradually started, when the added length exceeds half of the total length, the added sequence length is N50 length), one or more obtained chromosomes reach Telomere-to-Telomere (T2T) genome completion map is the ultimate goal of genome assembly.
The requirements for T2T genome assembly include: chromosome assembly is free of gaps, the QV value (genome assembly accuracy evaluation standard) is greater than 40, BUSCO (genome assembly integrity evaluation standard) evaluation is greater than 95%, and telomeres are complete. It has been difficult to achieve T2T levels in conventional single sequencing means, ONT common sequencing or PacBIOHiFi. The common ONT sequencing assembly precision is low, the quality standard with the QV more than 40 is difficult to reach, and the chromosome assembly is difficult to reach without gaps; while high assembly accuracy can be achieved by HiFi sequencing assembly, chromosome assembly is not free from gaps, and centromere regions often have assembly gaps.
Therefore, it is necessary to address the drawbacks of the prior art and propose a method for assembling T2T based on high throughput sequencing of animal and plant genomes.
Disclosure of Invention
The invention aims to provide a T2T assembly method based on high-throughput sequencing of animal and plant genome, which can overcome the defects in the past by combining an ONT ultra-long sequencing technology with a HiFi sequencing technology and a high-standard and high-quality assembly technology, and the genome assembly reaches the T2T level, so that the research progress of the whole species is promoted.
In view of this, the scheme of the invention is as follows:
an animal and plant genome T2T assembly method based on high-throughput sequencing comprises the following steps:
s1, based on sample sequencing, respectively performing ONT ultra-long sequencing data assembly and HiFi sequencing data assembly;
s2, removing redundancy of assembly results of the two routes based on a genome size evaluation result;
s3, respectively carrying out HiC mounting on the assembly results of the two routes to obtain a chromosome level reference genome gap;
s4, performing third-generation error correction on the ONT line genome centromere region, and performing second-generation error correction on the ONT line genome based on second-generation data;
s5, comparing terminal fragments based on original data, reassembling according to the abundance of the telomere repeated base sequence, replacing terminal telomere sequences through comparison, and extending telomeres of two lines;
s6, respectively supplementing gap and HiFi error correction to the genome sequence;
s7, adjusting the chromosomes by combining the genome sequence of the known species to obtain an assembled genome.
In one embodiment of the present invention, the third generation error correction procedure in the step S4 is: iterative rounds of three-generation error correction were performed on the ONT genomic centromere region using the k-mer anchoring method.
As a preferred embodiment, the ONT genomic centromere region is based on the second generation data and the frequency of k-mers in the genome, the markers are 21-kmers that occur once in the assembly and 14 to 46 times in the second generation data.
In one embodiment of the present invention, the second generation error correction process in the step S4 is: and cutting and comparing the second-generation original data to the genome, performing mutation detection through a deep neural network to obtain mutation information, and filtering and aligning the mutation information to obtain a consistency sequence so as to obtain the genome after second-generation error correction.
In one embodiment of the present invention, the step S5 telomere extension process is:
s51, comparing each chromosome with original data respectively, collecting all fragments which are compared once in a fixed length at the tail end of the chromosome, calculating the times of occurrence of telomere repeated base sequences in each fragment, defining the fragment with the largest occurrence times as ref, and reassembling ref and the query to obtain a consistency sequence;
s52, comparing the identical sequences to each chromosome respectively, and replacing the terminal telomere sequences by taking the optimal comparison result.
As a preferred embodiment, in the step S52, if the identity is lower than 80 threshold or the upper region of the alignment is not 20Kbp at the chromosome end, no substitution is performed.
In one embodiment of the present invention, the genome sequence complement step in step S6 is as follows: hole filling is carried out on the genome based on a hole filling program, hole filling data are compared with genome data, and hole filling is respectively carried out according to the sequence of other corrected genome versions, hiFi data and ONT data.
As a preferred embodiment, in the gap filling step, under the condition that the data volume is insufficient, the genome is filled with the original segment of the mutants/ont; if the position on the alignment can just cross the two ends of the gap, the optimal alignment region of the longest length region on the alignment is selected, and the sequence containing the gap region on the genome is replaced by the complementary gap data.
In one embodiment of the present invention, the genome HiFi error correction step in step S6 is: firstly filtering HiFi data lower than 10kbp, and comparing the filtered data with the genome of the hole to obtain a combined and ordered file; and respectively filtering out the fragments subjected to secondary alignment and chimeric alignment, and performing third-generation error correction.
In one embodiment of the invention, the T2T assembly method further comprises a collinearity analysis step of: performing colinear alignment analysis on the adjusted genome in the step S7 and the genome of the published species, wherein analysis indexes comprise integrity and accuracy.
Compared with the prior art, the beneficial effects of the invention include, but are not limited to:
1. the T2T assembly method based on the high-throughput sequencing animal and plant genome can overcome the defects in the past by combining an ONT ultra-long sequencing technology with a HiFi sequencing technology and a high-standard and high-quality assembly technology, the genome assembly reaches the T2T level, and the genome with complete telomeres and no gaps is obtained, so that the research progress of the whole species is promoted.
2. The high-throughput sequencing-based animal and plant genome T2T assembly method provided by the invention is based on a plurality of sequencing means and is assisted by complex open source software and self-research software to achieve the effect of simple assembly beyond a single sequencing means, thereby providing a new choice for genome T2T assembly.
Drawings
FIG. 1 is a flow chart of the method for assembling T2T genome of animals and plants according to the invention.
Fig. 2 is a flowchart showing steps specific to HiC of the present invention.
FIG. 3 is an interaction heat map based on the relationship between the interaction intensity and the position in the embodiment of the invention.
FIG. 4 is a heat map of Hi-C interaction of the final T2T version in an embodiment of the invention.
FIG. 5 is a schematic diagram showing the results of genome colinear analysis obtained by the T2T assembly method in the examples of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantageous technical effects of the present invention more apparent, the present invention will be described in further detail with reference to the following detailed description. It should be understood that the detailed description is intended to illustrate the invention, and not to limit the invention.
In one embodiment, a method for assembling animal and plant genome T2T based on high-throughput sequencing is provided, and the flow is shown in FIG. 1, and the steps are as follows:
1. sequencing: the same animal and plant biological samples need to be subjected to second generation sequencing (illuminea) or BGI (Huada gene), ONT ultra-long sequencing (OxfordNOPOPLOcore technology platform instrument sequencing), hiC sequencing (Hi-C technology is derived from genome capture technology), pacBioHiFi sequencing (PacificBisciences).
2. Evaluation: genome size was estimated by second generation sequencing into a surviviny genome analysis.
3. And (3) primary assembly: in two lines.
3.1, taking an ONT ultra-long sequencing data assembly result as a framework (hereinafter referred to as ONT framework), and respectively using Flye, nextdenovo and Necat three genome assembly software for assembly;
Flye:https://github.com/fenderglass/Flye;
Nextdenovo:https://github.com/Nextomics/NextDenovo;
Necat:https://github.com/xiaochuanle/NECAT。
3.2 assembling with HIFIasm (https:// gitsub.com/chhyp 123/hifiasm) using the HIFIasm sequencing data as a scaffold (hereinafter referred to as HIFI scaffold).
4. Redundancy elimination and evaluation
The assembled genome size exceeds the surviviny estimated size, and in order to reduce the genome to the estimated size, the assembly results of the two routes are required to be subjected to redundancy elimination; QV value calculation (post-assembly genome accuracy assessment value) was performed using merqury (https:// gilsub.com/searchq = merqury); post-assembly genome integrity assessment was performed using BUSCO (https:// BUSCO. Ezlab. Org /); back-aligned to the assembled genome using reads; removing low mass contigs (assembled sequences); the subject species mitochondria, chloroplasts were downloaded at NCBI and the assembled genome was aligned using minimap2 (https:// gitsub. Com/lh3/minimap 2), and contigs on the alignment were removed.
HiC mount
5.1ONT backbone route: selecting the best data in the three assembly results for HiC mounting (mounting is an application for guiding two-dimensional genome assembly by utilizing HiC interaction under the 3-dimensional condition); 2. HiFi backbone route: hiC. The number of chromosomes and the number of genomic gaps (unassembled sequences, represented as sequence blanks filled with 100N at the genomic chromosome level) were obtained by this step. The specific steps of HiC mount are shown in fig. 2, and are common routes in the art.
6. The step is applied to ONT skeleton route
Third generation error correction: intermediate results generated in a mitochondria-removed chloroplast low-reads support contig flow are firstly compared with third-generation filteroreads after filtration according to the length of 40Kbp, and the genome is subjected to mounting after hic. The number of kmer occurrences of the second generation data and the number of kmer occurrences in the genome were counted by means of the mer (https:// gitsub. Com/marbl/mer), and 21-kmer occurring once in the assembly and 14 to 46 times in the second generation data were found and considered to be the centromeric region to be marked. The bam file is marked by using a subprogram filter_by_marker_non.sh under T2 T_pole (https:// github.com/arangshie/T2T-Polish), and error correction is performed by using the medaka coherent (https:// github.com/nano-reich/medaka), and the medaka stitches to obtain a consistency sequence so as to achieve the aim of avoiding excessive correction of the centromere.
Second generation error correction: the second generation raw data was sliced, aligned to the genome separately using bwa (https:// gitsub. Com/lh 3/bwa), then sort to get the total sorted. Merge. Bam, and an index file was created. Performing mutation detection by using deepvariant (https:// github.com/google/deepvariant) to obtain a vcf file, and deriving a consistency sequence by bcftoolsconsausus (https:// github.com/samtools/bcftools) to obtain a genome after second-generation error correction.
7. Telomere extension for two routes
Comparing each chromosome with the original data by using a window map (https:// gitub. Com/marbl/Winnowmap), collecting all reads which are compared once within 50bp of the tail end of the chromosome, calculating the times of occurrence of telomere repetitive motifs ('CCCTAA'/'TTTAGGG') in each read, defining the read with the largest occurrence as ref, and the other is query, and reassembling the ref and the query by using the medaka_conconsu to obtain a consistency sequence. And (3) respectively comparing the identical sequences to each chromosome, and replacing the terminal telomere sequences by taking the optimal comparison result. If the identity is below the 80 threshold or the aligned upper region is not 20Kbp at the end of the chromosome, no substitution is made.
8. Supplement gap
The gap_closer flow is used for filling gaps on genome, the window map is used for comparing the hole filling data (without N) with the genome data (with N), the flow is divided into three levels for filling the gaps, and the priorities are as follows: other error corrected genome version > HiFi data > ONT data. The other version of the corrected genome can be nextdenovo, necat, hifiasm, canu … or any other version of genome; the HiFi data is preferably ccs.fasta, namely the HiFi data after cyclizing; the ONT data is preferably cns.fa, i.e. ONT data is subjected to the next denov self-error correction to generate a consistency sequence coherent. Under the condition of insufficient data volume, the selection of the mutants/ont original reads can be considered to fill in holes on the genome. If the position on the alignment can just cross the two ends of the gap, the optimal alignment region of the longest length region on the alignment is selected, and the sequence containing the gap region on the genome is replaced by the complementary gap data. Theoretically, the procedure can supplement the gap as long as the amount of HiFi/ONT data provided for hole supplement is large enough. However, the process is also limited, and is not suitable for a very short area with multiple gaps, so that the length of the gap extending to two ends is not recommended to exceed the shortest distance between two gaps.
HIFI error correction
In the step, the gap filling operation is performed, and in order to ensure that the sequences filled by us are reliable and the consistency of genome quality is ensured, the genome after hole filling is subjected to error correction by using HiFi data. Filtering out the original data with HiFi data below 10Kbp, and comparing the filtered data with the genome of the hole by using a window map to obtain a target. The bam file was filtered with samtools view-F256 (https:// github. Com/samtools/samtools), and the alignment fragments were filtered with falconcba-filter-clip (https:// github. Com/PacificB biosciences/FALCON), and finally error corrected with the lipover branch of racon (https:// github. Com/isovic/racon) to obtain the HiFi corrected genomic sequence.
10. Chromosome ordering and co-linearity
Mapping the assembly result onto the genome of the published species, corresponding the chromosome number and the chromosome direction according to the chromosome number and the chromosome direction of the published genome, and then co-linearizing the adjusted genome with the genome of the published species. Likewise, the genome and ref after position adjustment can also be input into a collinearity flow to be made a mu mmer collinearity diagram or jcvi collinearity diagram, so that a straight line or waterfall type collinearity diagram can be obtained, and the collinearity situation between the two versions can be better seen.
11. Genome final confirmation
And selecting a genome with high integrity and high accuracy from the two lines as a final genome.
Examples
1. Genome assembly was performed using Pacific Bio Inc. (PacBio) high fidelity sequencing data (HiFi) to generate contigs (contigs). Approximately 20Gb Pacific Biotechnology information center (NCBI) rice Pacific Biotechnology Co (PacBIO) high-fidelity sequencing data (HiFi) was used. Assembly was performed using a hifiasm (https:// github. Com/chhyp 123/hifiasm). The assembly results are shown in table 1.
Table 1: initial assembly result table of hifiasm
num_seqs sum_len min_len avg_len max_len N50
141 401950498 9114 2850712.8 45002514 31388662
2. Flye (https:// gilsub.com/senders glass/Flye), nextdenovo (https:// gilsub.com/NextDenovo), necat (https:// gilsub.com/xiaochuale/NECAT) were assembled using three types of genome assembly software, respectively, using about 50Gb of rice ONT ultralong data downloaded from the National Center for Biotechnology Information (NCBI). The assembly results are shown in table 2.
Table 2: ONT data preliminary assembly result table
num_seqs sum_len min_len avg_len max_len N50
Nextdenovo 23 398481786 110804 17325295.0 44949911 31426128
flye 77 395339674 1472 5134281.5 52039294 24103585
necat 61 401818451 23811 6587187.7 44884577 24129713
3. For highly heterozygous species, the genome needs to be dehisced using the software Purgejhaplotigs (v1.0.4; https:// gitub. Com/skingan/purgejhaplotigs_multibam)/Purgejdups (v1.2.5; https:// gitub. Com/df guide/purgejdups).
4. Alignment of mitochondria, chloroplasts using the minimap2 (2.17-r 941) (h.li 2018) software, removal of sequences with base alignment exceeding 50%; bacterial contamination was removed by blast refseq library; removal of low-reads supported contigs (Minimap 2 compares ONT reads (> = 40 kbp) to contigs, if more than 50% of the sites on a contig were below 15 a deep, the threshold was adjusted appropriately according to the data volume.
5. Auxiliary assembly using interaction relationship:
1) Contig clustering
And determining the degree of tightness of association between different Contigs in the effective data by using the Hi-C interaction relationship, and clustering the Contigs. For a genome sketch with a karyotype of 2n, the contact sequences of the sketch are clustered into n chromosome groups by means of an aglaic (v0.9.8) (Zhang et al.2019) software by means of an aglaignohierachicalcalalcfusion (bottom-up hierarchical clustering algorithm).
2) Contig sequencing and orientation
The method comprises the steps of sequencing and orienting the contacts inside n chromosome groups by ALLHIC (v0.9.8) (Zhangetal.2019) software, and converting the interaction relationship between the contacts into a specified binary file (namely, hic file) by software 3D-DNA (v 180419) (Dudchenkoal.2017) and juicer (v 1.6) (Nevac.al.2016b). The sequenced and oriented content (assembly) is then manually sequenced and oriented by windows software Juicebox (v1.11.08) (nevac.et al.2016a).
3) Contig redundancy elimination using interaction relationship
If a sequence does not interact at all with a sequence of the same size, but interacts normally with other sequences, then it is highly likely that the sequence will be a hybrid sequence and will need to be removed manually.
The sequenced, oriented, de-redundant Contig sequences were used to obtain the final chromosome-level genomic sequence using 100N-up gaps. Details of the chromosome and its length statistics are shown in Table 3, and the non-interacting or interacting noise-containing Contig is an unhooked fragment (denoted as chrUnn).
4) Chromosome mounting result statistics: after Hi-C assembly and manual adjustment, genomic sequences sharing the sequence length of 397,785,348 bp were mapped to 12 chromosomes at a 99.25% ratio. The detailed statistics of each chromosome and its length are shown in Table 3.
Table 3: chromosome length statistics
Chromosome Length(bp) contignumber
chr1 25,815,008 6
chr2 31,882,477 1
chr3 38,842,104 2
chr4 32,745,327 13
chr5 32,075,978 5
chr6 24,932,023 1
chr7 37,439,256 3
chr8 30,588,819 2
chr9 27,101,476 2
chr10 31,388,662 1
chr11 45,002,514 1
chr12 39,971,698 1
chrUnn 3,021,772 73
5) The software HiCExPLorer (v 3.6) (Joachim et al 2020) was used to plot based on the relationship between contig interaction strength and position, as shown in FIG. 3.
6. Assembly error correction
6.1 third generation error correction:
three generations of error correction were performed using the k-mer anchoring method for the centromere region (note that this error correction approach was only for T2T genome ont scaffold);
1) ONT reads (> = 40 kbp) were aligned to pseudochromosomes using minimap2 (v 2.17-r 941) (H.Li 2018).
2) To avoid excessive correction of centromere sequences, 21-kmer was found to occur once in assembly and 14 to 46 times in Illumina reads, and T2T-Polish was used
The subroutine filter_by_marker_nosplit (https:// github. Com/map/T2T-Polish) marks the aligned bam.
3) The use of medakaconsensus (v1.5.0; parameters: -model r941_prom_ hac _g507
-batch_size200; https:// gitsub.
4) Using a medakstitch (v1.5.0; https:// gitsub.com/nanoporetech/medaka) yields a consistency sequence after three generations of error correction.
Repeating the steps, namely, the T2T genome ONT skeleton edition needs to iterate two rounds of three-generation error correction.
6.2 second generation error correction:
second generation error correction was performed using deepVariant (v1.3.0) (poprinet al.2018) (note that this error correction approach is only for the T2T genome ONT backbone version);
1) Using bwamem (v0.7.17-r 1188) (HengLi 2013), samtools (v 1.9)
(Daneceketal.2021) to obtain a bam file with aligned third-generation error corrected genome and second-generation data.
2) Using deepvariant (v1.3.0; parameters: model_type=wgs) (poprinet al 20
18 A vcf file is obtained.
3) Bcftools (v 1.15; parameters: view-e ' type= "ref '", view-i ' QUAL >1
The vcf file is filtered by & (gt= "AA" ||gt= "AA")') (daneceketal.2021), and left-aligned and normalized with bcftoolsnorm (daneceketal.2021).
4) Compressing vcf files and generating index files.
5) bcftoolsconsensus (daneceketal.2021) derives a consistency sequence after second generation error correction.
7. Telomere extension
1) Winnowmap (v 1.11, parameters: k=15, -MD) (chiragetal.2020) all ontruss are aligned to ref and all r-russ aligned once within 50bp of the chromosome end are collected.
2) The number of occurrences of the telomere repeat motif ('CCCTAAA'/'ttaggg') in each read was calculated (telomere database: http:// telomerase. Asu. Edu/sequences_telomere. Html), and defines the read that occurs the most times as ref, the others as query.
3) The medaka_consingsu (v1.2.1, parameter: -mr941_min_high_g360; https:// gitsub.
com/nanoporetech/medaka), and reassembling ref telomere read and query telomere read to obtain a consensus sequence.
4) The sequence identical to the telomere was aligned to each chromosome using nucmer (v 3.1) (Kurtzetal.2004), and the terminal telomere sequence was replaced with the best alignment. If the identity is below the 80 threshold or the aligned upper region is not 20Kbp at the end of the chromosome, no substitution is made.
Gap filling
For filling up gaps in the genome, with a winnowmap (v 1.11, parameters: k=15, -MD)
(chiragetal 2020) compare hole filling data (without N) to genomic gap intervals (with N), this step is divided into three levels to fill gaps, with priority: other corrected genome version > hifi data > ont data. The other version of the corrected genome can be nextdenovo, necat, hifiasm, canu … or any other version of genome; the hifi data is preferably ccs, fasta, namely the hifi data after cyclizing; ont the data is preferably cns.fa, i.e. ont the data is subjected to the next denov self-error correction to produce the consensus sequence coherent. In the case of insufficient amounts of data, the selection of mutants/ontruses may be considered to fill in the genome. If the position on the alignment can just cross the two ends of the gap, the optimal alignment region of the longest length region on the alignment is selected, and the sequence containing the gap region on the genome is replaced by the complementary gap data. Theoretically, a 0gap genome could be obtained as long as the amount of hifi/ont data provided for hole filling is large enough and other assembled versions of the genome for hole filling are sufficiently large.
HiFi reads error correction
1) Using Winnowmap2 (Chirag et al 2020), the > = 10kbp HiFi reads were compared to the post-padding gap version genome (parameters: k=15 groter-than distict=0.9998
–MD-ax map-pb);
2) Samtools "view" (v 1.10, parameters: f256) (Danecek et al 2021) to filter the fragments;
3) Deletion of the alignment fragment (-t-F0 x 104) using "falconc bam-filter-clip";
4) Using these filtered alignment information, error correction is performed using a special branch of racon, (v 1.6.0,
-L-u,https://github.com/isovic/racon/tree/liftover)。
T2T genome evaluation
1) The continuity of genome assembly was assessed by the position and number of gaps on the genome, as shown in table 4.
Table 4:0gap genome
2) Using software bwa (version: 0.7.17-r 1188) and comparing (mapping) the quality control reads (clean reads) obtained by the second generation high throughput sequencing (such as the Illumina sequencing platform) to the reference genome. The statistical second-generation alignment and coverage are shown in table 5.
Table 5:
note that: sample name: sample name; mapping rate (%): the comparison rate; paired mapping rate (%): double-end comparison rate; average sequencing depth: average sequencing depth; coverage (%): coverage; coverage at least 4X (%): coverage above 4X depth; coverage at least 10X (%): coverage above 10X depth; coverage at least 20X (%): coverage above 20X depth.
3) Software BUSCO (version: 4.1.4; parameters: -evaluation 1 e-05) to evaluate the integrity of the genome assembly, the results are shown in table 6. At better genome integrity, the Complete BUSCOs (C) ratio in the B USCO evaluation in the T2T genome is typically greater than 95%.
Table 6: integrity assessment results of genome assembly
Note that: complete BUSCOs are Complete BUSCO; complete and single-copy BUSCOs are complete and single copy BUSCOs; complete and duplicated BUSCOs is complete but multicopy BUSCO; fragmented BUSCOs is incomplete BUSCO; the Missing BUSCO is Missing BUSCO; total BUSCO groups searched is the total number of conserved protein genes recorded in BUSCO pool used this time.
4) The results of the kmer statistics through the second generation data and the genome are shown in table 7. The number of the kmer species and the distribution thereof were compared to judge the quality and accuracy of genome assembly.
Table 7: genome assembly QV value statistics
qv completeness(%) length N50 ReadsNum
45.1739 99.1838 397,733,901 32,070,994 12
Note that: qv is the qv value used to assess genome quality; complexation is used to assess genome integrity; length is the genome length; n50 is N50 of contig grade, namely all contigs are sequentially accumulated after being sequenced from long to short, and when the added length reaches half of the total length of the genome, the last contig length is N50; reads Num is the number of contig-level genome Reads.
5) The results of the telomere sequence identification were based on the telomere region base repeat sequence (CCCTAAA at the 5 'end and ttaggg at the 3' end) and are shown in table 8.
Table 8: telomere repeat statistics
/>
6) Centromere sequence identification
(I)ChIP-Seq
After obtaining centromere ChIP-seq data, screening by comparison with genome sequence to obtain data with single copy on genome, and anchoring the data to genome of corresponding species to obtain ChIP-seq enriched peak diagram indicating centromere region. This method has been used for precise localization of the genome of centromeres.
(II) FISH verification
Fluorescent In Situ Hybridization (FISH) verification combined with cytology eventually allows identification of centromeric region specific repeat sequences.
11. Final T2T version Hi-C interaction heat map
After telomere extension, gap supplementation and error correction, a new chromosome-level Hi-C interaction heat map is obtained through Hi-C data again by using the final T2T version genome, and whether the previous operation is correct or not is identified through the abnormality of interaction signals in the heat map as shown in figure 4.
12. Co-linearity analysis
Comparing the assembled genome with historical genome version of the target species; the analysis was limited to historical genomic versions of species and the results of the colinear analysis are shown in figure 5. The historical genome version is a genome obtained by another T2T assembly method.
The present invention is not limited to the details and embodiments described herein, and further advantages and modifications may readily be achieved by those skilled in the art, so that the present invention is not limited to the specific details, representative solutions and examples described herein, without departing from the spirit and scope of the general concepts defined by the claims and the equivalents.

Claims (9)

1. A high-throughput sequencing-based animal and plant genome T2T assembly method, which is characterized by comprising the following steps:
s1, based on sample sequencing, respectively performing ONT ultra-long sequencing data assembly and HiFi sequencing data assembly;
s2, removing redundancy of assembly results of the two routes based on a genome size evaluation result;
s3, respectively carrying out HiC mounting on the assembly results of the two routes to obtain a chromosome level reference genome gap;
s4, performing third-generation error correction on the ONT line genome centromere region, and performing second-generation error correction on the ONT line genome based on second-generation data;
s5, comparing terminal fragments based on original data, reassembling according to the abundance of the telomere repeated base sequence, replacing terminal telomere sequences through comparison, and extending telomeres of two lines;
s6, respectively supplementing gap and HiFi error correction to the genome sequence;
s7, adjusting chromosomes by combining genome sequencing of known species to obtain assembled genomes;
the telomere extension process in the step S5 is as follows:
s51, comparing each chromosome with original data respectively, collecting all fragments which are compared once in a fixed length at the tail end of the chromosome, calculating the times of occurrence of telomere repeated base sequences in each fragment, defining the fragment with the largest occurrence times as ref, and reassembling ref and the query to obtain a consistency sequence;
s52, comparing the identical sequences to each chromosome respectively, and replacing the terminal telomere sequences by taking the optimal comparison result.
2. The method for assembling animal and plant genome T2T according to claim 1, wherein the third generation of error correction in step S4 is: iterative rounds of three-generation error correction were performed on the ONT genomic centromere region using the k-mer anchoring method.
3. The method of claim 2, wherein the ONT genome centromere region is based on second generation data and the frequency of k-mers in the genome, marking 21-kmers that occur once in assembly and 14 to 46 times in second generation data.
4. The method for assembling animal and plant genome T2T according to claim 1, wherein the second generation error correction process in step S4 is: and cutting and comparing the second-generation original data to the genome, performing mutation detection through a deep neural network to obtain mutation information, and filtering and aligning the mutation information to obtain a consistency sequence so as to obtain the genome after second-generation error correction.
5. The method according to claim 1, wherein in the step S52, if the identity is lower than 80 threshold or the alignment upper region is not 20Kbp at the chromosome end, no substitution is performed.
6. The method of claim 1, wherein the genome sequence gap filling step in step S6 is: hole filling is carried out on the genome based on a hole filling program, hole filling data are compared with genome data, and hole filling is respectively carried out according to the sequence of other corrected genome versions, hiFi data and ONT data.
7. The method according to claim 6, wherein in the gap filling step, the genome is filled with the original segment of the mutation/ont in the case of insufficient data; if the position on the alignment can just cross the two ends of the gap, the optimal alignment region of the longest length region on the alignment is selected, and the sequence containing the gap region on the genome is replaced by the complementary gap data.
8. The method of claim 1, wherein the genome HiFi correction step in step S6 is: firstly filtering HiFi data lower than 10kbp, and comparing the filtered data with the genome of the hole to obtain a combined and ordered file; and respectively filtering out the fragments subjected to secondary alignment and chimeric alignment, and performing third-generation error correction.
9. The method for assembling animal and plant genome T2T according to claim 1, further comprising the step of co-linearity analysis: performing colinear alignment analysis on the adjusted genome in the step S7 and the genome of the published species, wherein analysis indexes comprise integrity and accuracy.
CN202211549592.3A 2022-12-05 2022-12-05 T2T assembly method based on high-throughput sequencing animal and plant genome Active CN115810395B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211549592.3A CN115810395B (en) 2022-12-05 2022-12-05 T2T assembly method based on high-throughput sequencing animal and plant genome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211549592.3A CN115810395B (en) 2022-12-05 2022-12-05 T2T assembly method based on high-throughput sequencing animal and plant genome

Publications (2)

Publication Number Publication Date
CN115810395A CN115810395A (en) 2023-03-17
CN115810395B true CN115810395B (en) 2023-09-26

Family

ID=85484938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211549592.3A Active CN115810395B (en) 2022-12-05 2022-12-05 T2T assembly method based on high-throughput sequencing animal and plant genome

Country Status (1)

Country Link
CN (1) CN115810395B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116343919B (en) * 2023-04-11 2023-12-08 天津大学四川创新研究院 Whole genome map drawing and sequencing method
CN117672354B (en) * 2023-12-21 2024-05-28 北京诺禾致源科技股份有限公司 Method and apparatus for comparing quality of complete genome assembly of closely related species of mammals

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105121661A (en) * 2013-02-01 2015-12-02 加利福尼亚大学董事会 Methods for genome assembly and haplotype phasing
CN107250356A (en) * 2014-12-16 2017-10-13 加文医学研究所 Sequencing control
CN109326323A (en) * 2018-09-13 2019-02-12 北京百迈客生物科技有限公司 A kind of assemble method and device of genome
CN111584004A (en) * 2020-05-12 2020-08-25 西藏自治区农牧科学院水产科学研究所 Tibet characteristic fish genome assembly method based on three-dimensional omics data
CN112996927A (en) * 2018-10-31 2021-06-18 罗格斯新泽西州立大学 GRAMC: method for determining genome-scale reporter of cis-regulatory module
CN113151426A (en) * 2021-04-16 2021-07-23 中国农业科学院兰州畜牧与兽药研究所 Method for assembling and annotating Hobara sheep genome based on three-generation PacBio and Hi-C technology
CN113767438A (en) * 2019-02-28 2021-12-07 加利福尼亚太平洋生物科学股份有限公司 Improved alignment using homopolymer fold sequencing reads
CN113808668A (en) * 2021-11-18 2021-12-17 北京诺禾致源科技股份有限公司 Method and device for improving genome assembly integrity and application thereof
CN115148290A (en) * 2022-08-12 2022-10-04 武汉希望组生物科技有限公司 Hole filling method based on third-generation sequencing data

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105121661A (en) * 2013-02-01 2015-12-02 加利福尼亚大学董事会 Methods for genome assembly and haplotype phasing
CN107250356A (en) * 2014-12-16 2017-10-13 加文医学研究所 Sequencing control
CN109326323A (en) * 2018-09-13 2019-02-12 北京百迈客生物科技有限公司 A kind of assemble method and device of genome
CN112996927A (en) * 2018-10-31 2021-06-18 罗格斯新泽西州立大学 GRAMC: method for determining genome-scale reporter of cis-regulatory module
CN113767438A (en) * 2019-02-28 2021-12-07 加利福尼亚太平洋生物科学股份有限公司 Improved alignment using homopolymer fold sequencing reads
CN111584004A (en) * 2020-05-12 2020-08-25 西藏自治区农牧科学院水产科学研究所 Tibet characteristic fish genome assembly method based on three-dimensional omics data
CN113151426A (en) * 2021-04-16 2021-07-23 中国农业科学院兰州畜牧与兽药研究所 Method for assembling and annotating Hobara sheep genome based on three-generation PacBio and Hi-C technology
CN113808668A (en) * 2021-11-18 2021-12-17 北京诺禾致源科技股份有限公司 Method and device for improving genome assembly integrity and application thereof
CN115148290A (en) * 2022-08-12 2022-10-04 武汉希望组生物科技有限公司 Hole filling method based on third-generation sequencing data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
单倍体基因组序列组装方法研究;官登峰;《全国优秀博硕士学位论文全文库(博士) 基础科学辑》;全文 *
芥菜型油菜A9染色体物理图谱构建及重要基因研究;陆赢;《全国优秀博硕士学位论文全文库(博士) 农业科技辑》;全文 *

Also Published As

Publication number Publication date
CN115810395A (en) 2023-03-17

Similar Documents

Publication Publication Date Title
CN115810395B (en) T2T assembly method based on high-throughput sequencing animal and plant genome
Qiu et al. The genome of Populus alba x Populus tremula var. glandulosa clone 84K
Li et al. Genome sequence of cultivated Upland cotton (Gossypium hirsutum TM-1) provides insights into genome evolution
Ropars et al. Evidence for the sexual origin of heterokaryosis in arbuscular mycorrhizal fungi
JP5960917B1 (en) Rice whole genome breeding chip and its application
Shi et al. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding
Zheng et al. QTL mapping combined with bulked segregant analysis identify SNP markers linked to leaf shape traits in Pisum sativum using SLAF sequencing
CN106939342B (en) SNP marker linked with millet beige, primer and application
CN116334248A (en) Liquid chip for local chicken genetic resource protection and variety identification and application thereof
CN108642201B (en) SNP (Single nucleotide polymorphism) marker related to millet plant height character as well as detection primer and application thereof
Wang et al. Variation burst during dedifferentiation and increased CHH-type DNA methylation after 30 years of in vitro culture of sweet orange
KR101539737B1 (en) Methodology for improving efficiency of marker-assisted backcrossing using genome sequence and molecular marker
CN111088389B (en) SSR molecular marker closely linked to corn leaf width as well as amplification primer and application thereof
CN117965781A (en) Peanut 40K liquid-phase SNP chip 'PeanutGBTS K' and application thereof
Ercolano et al. Complex migration history is revealed by genetic diversity of tomato samples collected in Italy during the eighteenth and nineteenth centuries
CN109055593B (en) SNP (Single nucleotide polymorphism) marker for improving cotton lint and high-yield cotton identification and breeding method
KR20200055666A (en) DNA marker for discriminating genotype of Chinese cabbage, radish and their intergeneric hybrid and uses thereof
CN108707685B (en) SNP (Single nucleotide polymorphism) marker related to tillering number character of millet as well as detection primer and application thereof
CN114566214B (en) Method for detecting genome deletion insertion variation, detection device, computer readable storage medium and application
CN108642199B (en) SNP (Single nucleotide polymorphism) marker related to growth of millet flag leaves as well as detection primer and application thereof
CN108715901B (en) SNP marker related to millet plant height character and detection primer and application thereof
CN113718342A (en) Construction method of high-density genetic map of recombinant inbred line population
CN108642203B (en) SNP (Single nucleotide polymorphism) marker related to millet stem thickness character as well as detection primer and application thereof
CN108707684B (en) SNP (Single nucleotide polymorphism) marker related to millet flag leaf length and detection primer and application thereof
KR101911307B1 (en) Method for selecting and utilizing tag-SNP for discriminating haplotype in gene unit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant