CN115810395B

CN115810395B - T2T assembly method based on high-throughput sequencing animal and plant genome

Info

Publication number: CN115810395B
Application number: CN202211549592.3A
Authority: CN
Inventors: 舒自强; 蒋冕; 樊鹏宇; 杨路路; 陈刚; 苏晓俊; 田朝阳; 宋驰
Original assignee: Wuhan Bena Technology Co ltd
Current assignee: Wuhan Bena Technology Co ltd
Priority date: 2022-12-05
Filing date: 2022-12-05
Publication date: 2023-09-26
Anticipated expiration: 2042-12-05
Also published as: CN115810395A

Abstract

The invention discloses a T2T assembly method based on high-throughput sequencing animal and plant genome, which comprises the following steps: s1, based on sample sequencing, respectively performing ONT ultra-long sequencing data assembly and HiFi sequencing data assembly; s2, removing redundancy of assembly results of the two routes based on a genome size evaluation result; s3, respectively carrying out HiC mounting on the assembly results of the two routes to obtain a chromosome level reference genome gap; s4, performing third-generation error correction on the ONT line genome centromere region, and performing second-generation error correction on the ONT line genome based on second-generation data; s5, comparing terminal fragments based on original data, reassembling according to the abundance of the telomere repeated base sequence, replacing terminal telomere sequences through comparison, and extending telomeres of two lines; s6, respectively supplementing gap and HiFi error correction to the genome sequence; s7, adjusting chromosomes by combining genome sequencing of known species to obtain assembled genomes; the T2T genome telomeres are complete and have no gaps, so that the research progress of animals and plants is promoted.

Description

T2T assembly method based on high-throughput sequencing animal and plant genome

Technical Field

The invention relates to the technical field of gene assembly, in particular to a T2T assembly method based on high-throughput sequencing of animal and plant genome.

Background

High-throughput sequencing technology (High-throughput sequencing) is also known as Next-generation sequencing technology (Next-generation sequencing technology). High-throughput sequencing technology has now come to the fourth generation, nanopore sequencing technology (also known as fourth generation sequencing technology) is a new generation of sequencing technology that has emerged in recent years. The current sequencing length can reach 4Mb. This technology began in the 90 s and underwent three major technological innovations: 1. single molecule DNA passes through the nanopore; 2. the enzyme on the nanopore controls the single nucleotide precision of the sequencing molecule; 3. sequencing accuracy control of single nucleotides. With the continuous development of species genome research, sequencing technology is continuously improved, and the continuity and the integrity of the species genome are also greatly improved. Species genomes range from draft genome (genome 1.0) assembled by common second generation sequencing, to near complete genome (genome 3.0) assembled by ONT (N50 >50Kb or R10) sequencing techniques, to T2tgen genome (genome 4) assembled by binding to ontura-long (oxford Kong Chaochang sequencing techniques) and pacb High definition (High accuracy read length sequencing techniques) sequencing techniques. Animal and plant T2T genome refers to a horizontal genome obtained by combining HiFi and second generation data (high throughput sequencing, such as Illumina HiSeqTM/MiseqTM) through ONTulter-longN 50>100Kb (sequencing read length N50 is more than 1000000, N50: N50 is an evaluation index after genome splicing, all sequences obtained by splicing are sequenced from big to small according to sequence sizes, then accumulation is gradually started, when the added length exceeds half of the total length, the added sequence length is N50 length), one or more obtained chromosomes reach Telomere-to-Telomere (T2T) genome completion map is the ultimate goal of genome assembly.

The requirements for T2T genome assembly include: chromosome assembly is free of gaps, the QV value (genome assembly accuracy evaluation standard) is greater than 40, BUSCO (genome assembly integrity evaluation standard) evaluation is greater than 95%, and telomeres are complete. It has been difficult to achieve T2T levels in conventional single sequencing means, ONT common sequencing or PacBIOHiFi. The common ONT sequencing assembly precision is low, the quality standard with the QV more than 40 is difficult to reach, and the chromosome assembly is difficult to reach without gaps; while high assembly accuracy can be achieved by HiFi sequencing assembly, chromosome assembly is not free from gaps, and centromere regions often have assembly gaps.

Therefore, it is necessary to address the drawbacks of the prior art and propose a method for assembling T2T based on high throughput sequencing of animal and plant genomes.

Disclosure of Invention

The invention aims to provide a T2T assembly method based on high-throughput sequencing of animal and plant genome, which can overcome the defects in the past by combining an ONT ultra-long sequencing technology with a HiFi sequencing technology and a high-standard and high-quality assembly technology, and the genome assembly reaches the T2T level, so that the research progress of the whole species is promoted.

In view of this, the scheme of the invention is as follows:

an animal and plant genome T2T assembly method based on high-throughput sequencing comprises the following steps:

s1, based on sample sequencing, respectively performing ONT ultra-long sequencing data assembly and HiFi sequencing data assembly;

s2, removing redundancy of assembly results of the two routes based on a genome size evaluation result;

s3, respectively carrying out HiC mounting on the assembly results of the two routes to obtain a chromosome level reference genome gap;

s4, performing third-generation error correction on the ONT line genome centromere region, and performing second-generation error correction on the ONT line genome based on second-generation data;

s5, comparing terminal fragments based on original data, reassembling according to the abundance of the telomere repeated base sequence, replacing terminal telomere sequences through comparison, and extending telomeres of two lines;

s6, respectively supplementing gap and HiFi error correction to the genome sequence;

s7, adjusting the chromosomes by combining the genome sequence of the known species to obtain an assembled genome.

In one embodiment of the present invention, the third generation error correction procedure in the step S4 is: iterative rounds of three-generation error correction were performed on the ONT genomic centromere region using the k-mer anchoring method.

As a preferred embodiment, the ONT genomic centromere region is based on the second generation data and the frequency of k-mers in the genome, the markers are 21-kmers that occur once in the assembly and 14 to 46 times in the second generation data.

In one embodiment of the present invention, the second generation error correction process in the step S4 is: and cutting and comparing the second-generation original data to the genome, performing mutation detection through a deep neural network to obtain mutation information, and filtering and aligning the mutation information to obtain a consistency sequence so as to obtain the genome after second-generation error correction.

In one embodiment of the present invention, the step S5 telomere extension process is:

s51, comparing each chromosome with original data respectively, collecting all fragments which are compared once in a fixed length at the tail end of the chromosome, calculating the times of occurrence of telomere repeated base sequences in each fragment, defining the fragment with the largest occurrence times as ref, and reassembling ref and the query to obtain a consistency sequence;

s52, comparing the identical sequences to each chromosome respectively, and replacing the terminal telomere sequences by taking the optimal comparison result.

As a preferred embodiment, in the step S52, if the identity is lower than 80 threshold or the upper region of the alignment is not 20Kbp at the chromosome end, no substitution is performed.

In one embodiment of the present invention, the genome sequence complement step in step S6 is as follows: hole filling is carried out on the genome based on a hole filling program, hole filling data are compared with genome data, and hole filling is respectively carried out according to the sequence of other corrected genome versions, hiFi data and ONT data.

As a preferred embodiment, in the gap filling step, under the condition that the data volume is insufficient, the genome is filled with the original segment of the mutants/ont; if the position on the alignment can just cross the two ends of the gap, the optimal alignment region of the longest length region on the alignment is selected, and the sequence containing the gap region on the genome is replaced by the complementary gap data.

In one embodiment of the present invention, the genome HiFi error correction step in step S6 is: firstly filtering HiFi data lower than 10kbp, and comparing the filtered data with the genome of the hole to obtain a combined and ordered file; and respectively filtering out the fragments subjected to secondary alignment and chimeric alignment, and performing third-generation error correction.

In one embodiment of the invention, the T2T assembly method further comprises a collinearity analysis step of: performing colinear alignment analysis on the adjusted genome in the step S7 and the genome of the published species, wherein analysis indexes comprise integrity and accuracy.

Compared with the prior art, the beneficial effects of the invention include, but are not limited to:

1. the T2T assembly method based on the high-throughput sequencing animal and plant genome can overcome the defects in the past by combining an ONT ultra-long sequencing technology with a HiFi sequencing technology and a high-standard and high-quality assembly technology, the genome assembly reaches the T2T level, and the genome with complete telomeres and no gaps is obtained, so that the research progress of the whole species is promoted.

2. The high-throughput sequencing-based animal and plant genome T2T assembly method provided by the invention is based on a plurality of sequencing means and is assisted by complex open source software and self-research software to achieve the effect of simple assembly beyond a single sequencing means, thereby providing a new choice for genome T2T assembly.

Drawings

FIG. 1 is a flow chart of the method for assembling T2T genome of animals and plants according to the invention.

Fig. 2 is a flowchart showing steps specific to HiC of the present invention.

FIG. 3 is an interaction heat map based on the relationship between the interaction intensity and the position in the embodiment of the invention.

FIG. 4 is a heat map of Hi-C interaction of the final T2T version in an embodiment of the invention.

FIG. 5 is a schematic diagram showing the results of genome colinear analysis obtained by the T2T assembly method in the examples of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantageous technical effects of the present invention more apparent, the present invention will be described in further detail with reference to the following detailed description. It should be understood that the detailed description is intended to illustrate the invention, and not to limit the invention.

In one embodiment, a method for assembling animal and plant genome T2T based on high-throughput sequencing is provided, and the flow is shown in FIG. 1, and the steps are as follows:

1. sequencing: the same animal and plant biological samples need to be subjected to second generation sequencing (illuminea) or BGI (Huada gene), ONT ultra-long sequencing (OxfordNOPOPLOcore technology platform instrument sequencing), hiC sequencing (Hi-C technology is derived from genome capture technology), pacBioHiFi sequencing (PacificBisciences).

2. Evaluation: genome size was estimated by second generation sequencing into a surviviny genome analysis.

3. And (3) primary assembly: in two lines.

3.1, taking an ONT ultra-long sequencing data assembly result as a framework (hereinafter referred to as ONT framework), and respectively using Flye, nextdenovo and Necat three genome assembly software for assembly;

Flye：https://github.com/fenderglass/Flye；

Nextdenovo：https://github.com/Nextomics/NextDenovo；

Necat：https://github.com/xiaochuanle/NECAT。

3.2 assembling with HIFIasm (https:// gitsub.com/chhyp 123/hifiasm) using the HIFIasm sequencing data as a scaffold (hereinafter referred to as HIFI scaffold).

4. Redundancy elimination and evaluation

The assembled genome size exceeds the surviviny estimated size, and in order to reduce the genome to the estimated size, the assembly results of the two routes are required to be subjected to redundancy elimination; QV value calculation (post-assembly genome accuracy assessment value) was performed using merqury (https:// gilsub.com/searchq = merqury); post-assembly genome integrity assessment was performed using BUSCO (https:// BUSCO. Ezlab. Org /); back-aligned to the assembled genome using reads; removing low mass contigs (assembled sequences); the subject species mitochondria, chloroplasts were downloaded at NCBI and the assembled genome was aligned using minimap2 (https:// gitsub. Com/lh3/minimap 2), and contigs on the alignment were removed.

HiC mount

5.1ONT backbone route: selecting the best data in the three assembly results for HiC mounting (mounting is an application for guiding two-dimensional genome assembly by utilizing HiC interaction under the 3-dimensional condition); 2. HiFi backbone route: hiC. The number of chromosomes and the number of genomic gaps (unassembled sequences, represented as sequence blanks filled with 100N at the genomic chromosome level) were obtained by this step. The specific steps of HiC mount are shown in fig. 2, and are common routes in the art.

6. The step is applied to ONT skeleton route

Third generation error correction: intermediate results generated in a mitochondria-removed chloroplast low-reads support contig flow are firstly compared with third-generation filteroreads after filtration according to the length of 40Kbp, and the genome is subjected to mounting after hic. The number of kmer occurrences of the second generation data and the number of kmer occurrences in the genome were counted by means of the mer (https:// gitsub. Com/marbl/mer), and 21-kmer occurring once in the assembly and 14 to 46 times in the second generation data were found and considered to be the centromeric region to be marked. The bam file is marked by using a subprogram filter_by_marker_non.sh under T2 T_pole (https:// github.com/arangshie/T2T-Polish), and error correction is performed by using the medaka coherent (https:// github.com/nano-reich/medaka), and the medaka stitches to obtain a consistency sequence so as to achieve the aim of avoiding excessive correction of the centromere.

Second generation error correction: the second generation raw data was sliced, aligned to the genome separately using bwa (https:// gitsub. Com/lh 3/bwa), then sort to get the total sorted. Merge. Bam, and an index file was created. Performing mutation detection by using deepvariant (https:// github.com/google/deepvariant) to obtain a vcf file, and deriving a consistency sequence by bcftoolsconsausus (https:// github.com/samtools/bcftools) to obtain a genome after second-generation error correction.

7. Telomere extension for two routes

Comparing each chromosome with the original data by using a window map (https:// gitub. Com/marbl/Winnowmap), collecting all reads which are compared once within 50bp of the tail end of the chromosome, calculating the times of occurrence of telomere repetitive motifs ('CCCTAA'/'TTTAGGG') in each read, defining the read with the largest occurrence as ref, and the other is query, and reassembling the ref and the query by using the medaka_conconsu to obtain a consistency sequence. And (3) respectively comparing the identical sequences to each chromosome, and replacing the terminal telomere sequences by taking the optimal comparison result. If the identity is below the 80 threshold or the aligned upper region is not 20Kbp at the end of the chromosome, no substitution is made.

8. Supplement gap

The gap_closer flow is used for filling gaps on genome, the window map is used for comparing the hole filling data (without N) with the genome data (with N), the flow is divided into three levels for filling the gaps, and the priorities are as follows: other error corrected genome version > HiFi data > ONT data. The other version of the corrected genome can be nextdenovo, necat, hifiasm, canu … or any other version of genome; the HiFi data is preferably ccs.fasta, namely the HiFi data after cyclizing; the ONT data is preferably cns.fa, i.e. ONT data is subjected to the next denov self-error correction to generate a consistency sequence coherent. Under the condition of insufficient data volume, the selection of the mutants/ont original reads can be considered to fill in holes on the genome. If the position on the alignment can just cross the two ends of the gap, the optimal alignment region of the longest length region on the alignment is selected, and the sequence containing the gap region on the genome is replaced by the complementary gap data. Theoretically, the procedure can supplement the gap as long as the amount of HiFi/ONT data provided for hole supplement is large enough. However, the process is also limited, and is not suitable for a very short area with multiple gaps, so that the length of the gap extending to two ends is not recommended to exceed the shortest distance between two gaps.

HIFI error correction

In the step, the gap filling operation is performed, and in order to ensure that the sequences filled by us are reliable and the consistency of genome quality is ensured, the genome after hole filling is subjected to error correction by using HiFi data. Filtering out the original data with HiFi data below 10Kbp, and comparing the filtered data with the genome of the hole by using a window map to obtain a target. The bam file was filtered with samtools view-F256 (https:// github. Com/samtools/samtools), and the alignment fragments were filtered with falconcba-filter-clip (https:// github. Com/PacificB biosciences/FALCON), and finally error corrected with the lipover branch of racon (https:// github. Com/isovic/racon) to obtain the HiFi corrected genomic sequence.

10. Chromosome ordering and co-linearity

Mapping the assembly result onto the genome of the published species, corresponding the chromosome number and the chromosome direction according to the chromosome number and the chromosome direction of the published genome, and then co-linearizing the adjusted genome with the genome of the published species. Likewise, the genome and ref after position adjustment can also be input into a collinearity flow to be made a mu mmer collinearity diagram or jcvi collinearity diagram, so that a straight line or waterfall type collinearity diagram can be obtained, and the collinearity situation between the two versions can be better seen.

11. Genome final confirmation

And selecting a genome with high integrity and high accuracy from the two lines as a final genome.

Examples

1. Genome assembly was performed using Pacific Bio Inc. (PacBio) high fidelity sequencing data (HiFi) to generate contigs (contigs). Approximately 20Gb Pacific Biotechnology information center (NCBI) rice Pacific Biotechnology Co (PacBIO) high-fidelity sequencing data (HiFi) was used. Assembly was performed using a hifiasm (https:// github. Com/chhyp 123/hifiasm). The assembly results are shown in table 1.

Table 1: initial assembly result table of hifiasm

num_seqs	sum_len	min_len	avg_len	max_len	N50
						141	401950498	9114	2850712.8	45002514	31388662

2. Flye (https:// gilsub.com/senders glass/Flye), nextdenovo (https:// gilsub.com/NextDenovo), necat (https:// gilsub.com/xiaochuale/NECAT) were assembled using three types of genome assembly software, respectively, using about 50Gb of rice ONT ultralong data downloaded from the National Center for Biotechnology Information (NCBI). The assembly results are shown in table 2.

Table 2: ONT data preliminary assembly result table

	num_seqs	sum_len	min_len	avg_len	max_len	N50
							Nextdenovo	23	398481786	110804	17325295.0	44949911	31426128
flye	77	395339674	1472	5134281.5	52039294	24103585
							necat	61	401818451	23811	6587187.7	44884577	24129713

3. For highly heterozygous species, the genome needs to be dehisced using the software Purgejhaplotigs (v1.0.4; https:// gitub. Com/skingan/purgejhaplotigs_multibam)/Purgejdups (v1.2.5; https:// gitub. Com/df guide/purgejdups).

4. Alignment of mitochondria, chloroplasts using the minimap2 (2.17-r 941) (h.li 2018) software, removal of sequences with base alignment exceeding 50%; bacterial contamination was removed by blast refseq library; removal of low-reads supported contigs (Minimap 2 compares ONT reads (> = 40 kbp) to contigs, if more than 50% of the sites on a contig were below 15 a deep, the threshold was adjusted appropriately according to the data volume.

5. Auxiliary assembly using interaction relationship:

1) Contig clustering

And determining the degree of tightness of association between different Contigs in the effective data by using the Hi-C interaction relationship, and clustering the Contigs. For a genome sketch with a karyotype of 2n, the contact sequences of the sketch are clustered into n chromosome groups by means of an aglaic (v0.9.8) (Zhang et al.2019) software by means of an aglaignohierachicalcalalcfusion (bottom-up hierarchical clustering algorithm).

2) Contig sequencing and orientation

The method comprises the steps of sequencing and orienting the contacts inside n chromosome groups by ALLHIC (v0.9.8) (Zhangetal.2019) software, and converting the interaction relationship between the contacts into a specified binary file (namely, hic file) by software 3D-DNA (v 180419) (Dudchenkoal.2017) and juicer (v 1.6) (Nevac.al.2016b). The sequenced and oriented content (assembly) is then manually sequenced and oriented by windows software Juicebox (v1.11.08) (nevac.et al.2016a).

3) Contig redundancy elimination using interaction relationship

If a sequence does not interact at all with a sequence of the same size, but interacts normally with other sequences, then it is highly likely that the sequence will be a hybrid sequence and will need to be removed manually.

The sequenced, oriented, de-redundant Contig sequences were used to obtain the final chromosome-level genomic sequence using 100N-up gaps. Details of the chromosome and its length statistics are shown in Table 3, and the non-interacting or interacting noise-containing Contig is an unhooked fragment (denoted as chrUnn).

4) Chromosome mounting result statistics: after Hi-C assembly and manual adjustment, genomic sequences sharing the sequence length of 397,785,348 bp were mapped to 12 chromosomes at a 99.25% ratio. The detailed statistics of each chromosome and its length are shown in Table 3.

Table 3: chromosome length statistics

Chromosome	Length(bp)	contignumber
			chr1	25,815,008	6
chr2	31,882,477	1
			chr3	38,842,104	2
chr4	32,745,327	13
			chr5	32,075,978	5
chr6	24,932,023	1
			chr7	37,439,256	3
chr8	30,588,819	2
			chr9	27,101,476	2
chr10	31,388,662	1
			chr11	45,002,514	1
chr12	39,971,698	1
			chrUnn	3,021,772	73

5) The software HiCExPLorer (v 3.6) (Joachim et al 2020) was used to plot based on the relationship between contig interaction strength and position, as shown in FIG. 3.

6. Assembly error correction

6.1 third generation error correction:

three generations of error correction were performed using the k-mer anchoring method for the centromere region (note that this error correction approach was only for T2T genome ont scaffold);

1) ONT reads (> = 40 kbp) were aligned to pseudochromosomes using minimap2 (v 2.17-r 941) (H.Li 2018).

2) To avoid excessive correction of centromere sequences, 21-kmer was found to occur once in assembly and 14 to 46 times in Illumina reads, and T2T-Polish was used

The subroutine filter_by_marker_nosplit (https:// github. Com/map/T2T-Polish) marks the aligned bam.

3) The use of medakaconsensus (v1.5.0; parameters: -model r941_prom_ hac _g507

-batch_size200; https:// gitsub.

4) Using a medakstitch (v1.5.0; https:// gitsub.com/nanoporetech/medaka) yields a consistency sequence after three generations of error correction.

Repeating the steps, namely, the T2T genome ONT skeleton edition needs to iterate two rounds of three-generation error correction.

6.2 second generation error correction:

second generation error correction was performed using deepVariant (v1.3.0) (poprinet al.2018) (note that this error correction approach is only for the T2T genome ONT backbone version);

1) Using bwamem (v0.7.17-r 1188) (HengLi 2013), samtools (v 1.9)

(Daneceketal.2021) to obtain a bam file with aligned third-generation error corrected genome and second-generation data.

2) Using deepvariant (v1.3.0; parameters: model_type=wgs) (poprinet al 20

18 A vcf file is obtained.

3) Bcftools (v 1.15; parameters: view-e ' type= "ref '", view-i ' QUAL >1

The vcf file is filtered by & (gt= "AA" ||gt= "AA")') (daneceketal.2021), and left-aligned and normalized with bcftoolsnorm (daneceketal.2021).

4) Compressing vcf files and generating index files.

5) bcftoolsconsensus (daneceketal.2021) derives a consistency sequence after second generation error correction.

7. Telomere extension

1) Winnowmap (v 1.11, parameters: k=15, -MD) (chiragetal.2020) all ontruss are aligned to ref and all r-russ aligned once within 50bp of the chromosome end are collected.

2) The number of occurrences of the telomere repeat motif ('CCCTAAA'/'ttaggg') in each read was calculated (telomere database: http:// telomerase. Asu. Edu/sequences_telomere. Html), and defines the read that occurs the most times as ref, the others as query.

3) The medaka_consingsu (v1.2.1, parameter: -mr941_min_high_g360; https:// gitsub.

com/nanoporetech/medaka), and reassembling ref telomere read and query telomere read to obtain a consensus sequence.

4) The sequence identical to the telomere was aligned to each chromosome using nucmer (v 3.1) (Kurtzetal.2004), and the terminal telomere sequence was replaced with the best alignment. If the identity is below the 80 threshold or the aligned upper region is not 20Kbp at the end of the chromosome, no substitution is made.

Gap filling

For filling up gaps in the genome, with a winnowmap (v 1.11, parameters: k=15, -MD)

(chiragetal 2020) compare hole filling data (without N) to genomic gap intervals (with N), this step is divided into three levels to fill gaps, with priority: other corrected genome version > hifi data > ont data. The other version of the corrected genome can be nextdenovo, necat, hifiasm, canu … or any other version of genome; the hifi data is preferably ccs, fasta, namely the hifi data after cyclizing; ont the data is preferably cns.fa, i.e. ont the data is subjected to the next denov self-error correction to produce the consensus sequence coherent. In the case of insufficient amounts of data, the selection of mutants/ontruses may be considered to fill in the genome. If the position on the alignment can just cross the two ends of the gap, the optimal alignment region of the longest length region on the alignment is selected, and the sequence containing the gap region on the genome is replaced by the complementary gap data. Theoretically, a 0gap genome could be obtained as long as the amount of hifi/ont data provided for hole filling is large enough and other assembled versions of the genome for hole filling are sufficiently large.

HiFi reads error correction

1) Using Winnowmap2 (Chirag et al 2020), the > = 10kbp HiFi reads were compared to the post-padding gap version genome (parameters: k=15 groter-than distict=0.9998

–MD-ax map-pb)；

2) Samtools "view" (v 1.10, parameters: f256) (Danecek et al 2021) to filter the fragments;

3) Deletion of the alignment fragment (-t-F0 x 104) using "falconc bam-filter-clip";

4) Using these filtered alignment information, error correction is performed using a special branch of racon, (v 1.6.0,

-L-u，https://github.com/isovic/racon/tree/liftover)。

T2T genome evaluation

1) The continuity of genome assembly was assessed by the position and number of gaps on the genome, as shown in table 4.

Table 4:0gap genome

2) Using software bwa (version: 0.7.17-r 1188) and comparing (mapping) the quality control reads (clean reads) obtained by the second generation high throughput sequencing (such as the Illumina sequencing platform) to the reference genome. The statistical second-generation alignment and coverage are shown in table 5.

Table 5:

note that: sample name: sample name; mapping rate (%): the comparison rate; paired mapping rate (%): double-end comparison rate; average sequencing depth: average sequencing depth; coverage (%): coverage; coverage at least 4X (%): coverage above 4X depth; coverage at least 10X (%): coverage above 10X depth; coverage at least 20X (%): coverage above 20X depth.

3) Software BUSCO (version: 4.1.4; parameters: -evaluation 1 e-05) to evaluate the integrity of the genome assembly, the results are shown in table 6. At better genome integrity, the Complete BUSCOs (C) ratio in the B USCO evaluation in the T2T genome is typically greater than 95%.

Table 6: integrity assessment results of genome assembly

Note that: complete BUSCOs are Complete BUSCO; complete and single-copy BUSCOs are complete and single copy BUSCOs; complete and duplicated BUSCOs is complete but multicopy BUSCO; fragmented BUSCOs is incomplete BUSCO; the Missing BUSCO is Missing BUSCO; total BUSCO groups searched is the total number of conserved protein genes recorded in BUSCO pool used this time.

4) The results of the kmer statistics through the second generation data and the genome are shown in table 7. The number of the kmer species and the distribution thereof were compared to judge the quality and accuracy of genome assembly.

Table 7: genome assembly QV value statistics

qv	completeness(％)	length	N50	ReadsNum
					45.1739	99.1838	397,733,901	32,070,994	12

Note that: qv is the qv value used to assess genome quality; complexation is used to assess genome integrity; length is the genome length; n50 is N50 of contig grade, namely all contigs are sequentially accumulated after being sequenced from long to short, and when the added length reaches half of the total length of the genome, the last contig length is N50; reads Num is the number of contig-level genome Reads.

5) The results of the telomere sequence identification were based on the telomere region base repeat sequence (CCCTAAA at the 5 'end and ttaggg at the 3' end) and are shown in table 8.

Table 8: telomere repeat statistics

/>

6) Centromere sequence identification

(I)ChIP-Seq

After obtaining centromere ChIP-seq data, screening by comparison with genome sequence to obtain data with single copy on genome, and anchoring the data to genome of corresponding species to obtain ChIP-seq enriched peak diagram indicating centromere region. This method has been used for precise localization of the genome of centromeres.

(II) FISH verification

Fluorescent In Situ Hybridization (FISH) verification combined with cytology eventually allows identification of centromeric region specific repeat sequences.

11. Final T2T version Hi-C interaction heat map

After telomere extension, gap supplementation and error correction, a new chromosome-level Hi-C interaction heat map is obtained through Hi-C data again by using the final T2T version genome, and whether the previous operation is correct or not is identified through the abnormality of interaction signals in the heat map as shown in figure 4.

12. Co-linearity analysis

Comparing the assembled genome with historical genome version of the target species; the analysis was limited to historical genomic versions of species and the results of the colinear analysis are shown in figure 5. The historical genome version is a genome obtained by another T2T assembly method.

The present invention is not limited to the details and embodiments described herein, and further advantages and modifications may readily be achieved by those skilled in the art, so that the present invention is not limited to the specific details, representative solutions and examples described herein, without departing from the spirit and scope of the general concepts defined by the claims and the equivalents.

Claims

1. A high-throughput sequencing-based animal and plant genome T2T assembly method, which is characterized by comprising the following steps:

s7, adjusting chromosomes by combining genome sequencing of known species to obtain assembled genomes;

the telomere extension process in the step S5 is as follows:

2. The method for assembling animal and plant genome T2T according to claim 1, wherein the third generation of error correction in step S4 is: iterative rounds of three-generation error correction were performed on the ONT genomic centromere region using the k-mer anchoring method.

3. The method of claim 2, wherein the ONT genome centromere region is based on second generation data and the frequency of k-mers in the genome, marking 21-kmers that occur once in assembly and 14 to 46 times in second generation data.

4. The method for assembling animal and plant genome T2T according to claim 1, wherein the second generation error correction process in step S4 is: and cutting and comparing the second-generation original data to the genome, performing mutation detection through a deep neural network to obtain mutation information, and filtering and aligning the mutation information to obtain a consistency sequence so as to obtain the genome after second-generation error correction.

5. The method according to claim 1, wherein in the step S52, if the identity is lower than 80 threshold or the alignment upper region is not 20Kbp at the chromosome end, no substitution is performed.

6. The method of claim 1, wherein the genome sequence gap filling step in step S6 is: hole filling is carried out on the genome based on a hole filling program, hole filling data are compared with genome data, and hole filling is respectively carried out according to the sequence of other corrected genome versions, hiFi data and ONT data.

7. The method according to claim 6, wherein in the gap filling step, the genome is filled with the original segment of the mutation/ont in the case of insufficient data; if the position on the alignment can just cross the two ends of the gap, the optimal alignment region of the longest length region on the alignment is selected, and the sequence containing the gap region on the genome is replaced by the complementary gap data.

8. The method of claim 1, wherein the genome HiFi correction step in step S6 is: firstly filtering HiFi data lower than 10kbp, and comparing the filtered data with the genome of the hole to obtain a combined and ordered file; and respectively filtering out the fragments subjected to secondary alignment and chimeric alignment, and performing third-generation error correction.

9. The method for assembling animal and plant genome T2T according to claim 1, further comprising the step of co-linearity analysis: performing colinear alignment analysis on the adjusted genome in the step S7 and the genome of the published species, wherein analysis indexes comprise integrity and accuracy.