CN110600078B

CN110600078B - Method for detecting genome structure variation based on nanopore sequencing

Info

Publication number: CN110600078B
Application number: CN201910786443.0A
Authority: CN
Inventors: 郑洪坤; 王运通; 李绪明; 邓德晶; 梁若冰; 王晶
Original assignee: Beijing Biomarker Technologies Co ltd
Current assignee: Beijing Biomarker Technologies Co ltd
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2022-03-18
Anticipated expiration: 2039-08-23
Also published as: CN110600078A

Abstract

The invention provides a method for detecting genome structure variation based on nanopore sequencing, which comprises the steps of performing nanopore sequencing data quality control and performing genome mapping; mining the information of the within-alignment reads and the split reads; defining structural variations and typing; multi-sample data variant type integration. The method can detect the variation of the mutant and the wild type material, the variation between the extremely resistant materials, the detection of the transgenic event and the position search of the inserted fragment; the method can also organically integrate results in large sample size and assist in correcting the variation sites; providing abundant structural variation data aiming at a natural resource group, and carrying out subsequent GWAS analysis; the method can detect common structural variation of large fragments, and has high accuracy and precision in chimeric variation detection; meanwhile, clustering is carried out aiming at reads where the structural variation of the typing is located, and homozygosity and heterozygosity of the structural variation are judged.

Description

Method for detecting genome structure variation based on nanopore sequencing

Technical Field

The invention belongs to the technical field of bioinformatics, and particularly relates to a method for detecting genome structural variation based on nanopore sequencing, and particularly relates to offline data quality control, reference genome comparison, structural variation detection and final large-sample-volume data integration of nanopore sequencing.

Background

Structural genomic variations (structural variants), which generally refer to sequence changes and positional relationship changes of large fragments on a genome, are abundant in variant types, including large fragment sequence insertions or deletions (Big indels) with lengths of 50bp or more, Tandem repeats (Tandem repeat), chromosomal Inversion (Inversion), chromosomal Translocation (Translocation), Copy Number Variation (CNV), and more complex-form chimeric variations. Compared with SNP (single nucleotide polymorphism), the structural variation accounts for a larger proportion of the variation base number, has larger influence on the genome, and often brings great influence on the living body once the variation occurs. On the top of human beings, some rare and same structural variations and diseases are correlated with each other and even directly cause the diseases, such as autism, obesity, schizophrenia and cancer, and the like are related to the structural variations; on plants, structural variations are associated with many phenotypic variations, biotic/abiotic stresses and are therefore an increasingly important field of research.

Most high-throughput sequencing technologies are basically used for sequencing short-segment DNA (generally 150-300bp), and are difficult to analyze large structural variation, and relatively low in accuracy and precision. Therefore, there is a need in the art for a method that can accurately and precisely detect genomic structural variations by sequencing long-fragment DNA.

Disclosure of Invention

The invention aims to solve the defect that no complete method for detecting structural variation based on nanopore sequencing for long-fragment DNA exists in the prior art, and provides a method for detecting large-fragment structural variation and mosaic structural variation based on reads comparison (within-alignment and split reads) information, and performing structural variation clustering by combining reads support information to judge heterozygosity and homozygosity.

The invention provides a method for detecting genome structure variation based on nanopore sequencing, which comprises the following steps:

(1) performing nanopore sequencing data quality control, and performing genome mapping;

(2) mining information of within-alignment reads and split reads based on the comparison result in the step (1);

(3) defining structural variation and typing based on reads support information;

(4) multi-sample data variant type integration.

The flowchart of the method for detecting the genome structural variation based on the nanopore sequencing is shown in figure 1.

In the method, the quality control of the nanopore sequencing data in the step (1) and the genome mapping comprise the following steps: a. filtering the sequencing adaptor sequence, filtering reads with the quality value less than 7 and filtering reads with the length less than 500 bp;

b. mapping the clean data after quality control to a reference genome, and carrying out statistics on comparison efficiency, whole genome reads coverage and sequencing depth;

c. and (c) performing overall comparison quality evaluation on the nanopore sequencing data based on the information in the step b, judging that the third-generation data is abnormal, and performing downstream analysis if the comparison efficiency is higher than 70% and the whole genome coverage is higher than 50%.

The method in the step (1) comprises the following steps: nanopore sequencer off-line data filter junctions, low quality reads, and short fragment reads. Counting the length distribution of Clean data (see fig. 2), wherein the abscissa represents the length distribution (bp) of reads, and the ordinate on the left side represents the number of reads, and corresponds to a histogram indicated by an arrow 1; the ordinate on the right represents the total number of bases (Mb) contained in reads greater than the respective length, corresponding to the curve indicated by arrow 2; the dotted line indicated by arrow 3 indicates the length of N50 for reads. Clean data is then aligned to the reference genome, where the percentage of bases covered by Reads on the genome is called genome coverage (see fig. 3), which is mainly affected by the depth of sequencing and how close the sample is related to the reference genome. The number of Reads covered on the base is the depth of coverage (see FIG. 4). The depth of coverage of the genome affects the accuracy of mutation detection, and in regions with higher depth of coverage, the accuracy of mutation detection is higher. And (4) confirming the data quantity and quality of downstream analysis through evaluation judgment of long reads distribution, genome coverage and coverage depth after filtering.

In the method for detecting genome structure variation based on nanopore sequencing, the information of within-alignment reads and split reads is mined in the step (2), and the method comprises the following steps:

a. performing parameter initialization evaluation based on the long reads information compared in the step (1): randomly extracting 1000 reads, evaluating the overall error rate of the reads and the comparison score of the reads, counting the distance of comparison differences (mismatch, indel) by a 100bp sliding window based on the reads, and counting the ratio of the main reads and the suboptimal comparison scores of each read;

b. according to the comparison score and mapping quality of reads and the reliability of evaluating the specific reads by combining the parameters in the step a, discarding the following comparison reads information: mapping quality is less than 20, the ratio of primary/secondary optimal comparison reads comparison scores is less than 2, the reads comparison scores are less than the minimum comparison scores (scoring is from parameter initialization evaluation), and reads are subjected to soft cutting for more than 7 times;

c. and recording the information of the thread-alignment reads and the split reads aiming at the residual reads after filtering according to the comparison information.

And (2) performing parameter initialization evaluation by combining 1000 pieces of randomly extracted reads information according to the comparison score and the comparison quality, discarding reads with mapping quality less than 20, main/suboptimal ratio less than 2, comparison score less than the minimum score of parameter initialization evaluation, and reads soft-cutting for more than 7 times, wherein the remaining comparison reads information is the defined high-quality result. And recording the thread-alignment reads and the split reads as subsequent structural variation judgment data according to the comparison information.

In the method of the present invention, the step (3) of defining the structural variation type and the typing based on the reads support information includes the steps of:

a. based on the thread-alignment reads information in the step (2), for the MD and CIGAR information of each reads record,

b. based on the split reads information in the step (2), performing soft cutting on each read for multiple times based on comparison information, wherein the longest mark is a main read, the rest marks are suboptimal comparison reads and other reads, and coordinate information is recorded for all comparison reads;

c. merging the potential structural variation detected in the step a and the step b, simultaneously counting the structural variation of each reads, merging the structural variations of the same type at the same position of less than 1000bp, simultaneously recording the supported number of the reads, and discarding the structural variation information if the supported number of the reads is less than 10 self-defined pieces; or judging the structure variation as a potential structure variation, needing more reads information verification, and the rest is the final structure variation information.

In the step (3), firstly, extracting the aligned genome position, mismatch and indel information; then, for all indel information, if the indel information is larger than the self-defined 50bp, the indel information is judged to be potential structural variation; then, scanning comparison segments of reads by adopting a Plane-sweet algorithm, and recording coordinates of potential structural variation; finally, potential insertion, deletion and noise regions can be identified based on the thread-alignment reads information, if one read contains more than 3 noise regions, the read is discarded, and the structural variation record information on the read is also discarded; the noise regions are the mismatch and small indel regions.

In b, the method further comprises the following steps: reads are classified into two types: (1) reads soft cut twice and covers mainly simple structural variations and (2) reads soft cut more than 2 times and covers mainly complex structural variation regions.

And in the step b, for the type (1), if the reads obtained by soft cutting twice are compared to the same chromosome and the coordinate directions are consistent, judging according to the recorded reads coordinates and genome coordinates: distance_reads-Distance_genome>50bp, defined as insertion (see FIG. 5); distance_genome-Distance_reads>50bp, defined as deletion (see FIG. 6); distance_genome>50bp, simultaneous Distance_reads<50bp, defined as repeat (see FIG. 7);

if the reads of the two times of soft cutting are compared to the same chromosome and the coordinate directions are inconsistent, determining that inversion is carried out; and if the reads subjected to the soft cutting twice are aligned to different chromosomes, judging the easy location between the chromosomes.

For the type (2), if the segment has a short segment with overlap greater than 200bp and more than 40%, the coordinates of the two segments are not consistent, and the segment is judged to be inverted and repeated.

In the method, the multi-sample data variation type integration in the step (4) is carried out, the method is based on the reads combined information in the step (3) to carry out structural variation clustering, the allele frequency is judged to be homozygous structural variation when being more than 0.8, and the allele frequency is judged to be heterozygous structural variation when being within 0.3-0.8; the structural variation of each sample is in the genome position and variation type, and the data of all samples are merged, and the two types of 'existence' and 'deletion' of each site are marked simultaneously.

The invention provides one or more of the following applications of the method for detecting the genome structural variation based on the nanopore sequencing,

(1) detecting variations in the mutant and wild-type material;

(2) detecting variations between extremely resistant materials;

(3) detecting a transgenic event and searching the position of an insert;

(4) organically integrating results aiming at a large sample size, and assisting in correcting the variation sites;

(5) providing abundant structural variation data for a natural resource group for GWAS analysis;

(6) detecting structural variation and/or chimeric variation of the large fragment;

(7) clustering is carried out aiming at reads where the typed structural variation is located, and homozygosity and heterozygosity of the structural variation are judged.

The invention has the beneficial effects that: (1) the method realizes parameter evaluation and filtration of low-quality comparison information; compared with the second generation data, the accuracy is estimated to be about 90%, and the accuracy is more than 80%. (2) The detection of the structural variation of small and medium-sized fragments and large fragments (>10kb) is realized; (3) the method can detect the structural variation of the complex region of the genome; (4) automated processing of large sample size structural variants.

The method is not only suitable for individual re-sequencing, but also for excavating mutation and structural variation of wild materials; detecting a transgenic event and searching the position of an insert; detection of variation between extremely resistant materials. Meanwhile, the method is suitable for the aspect of group body weight sequencing, the function of structural variation in group genetic evolution is researched, the domestication relation is selected, and downstream structural variation GWAS analysis can also be carried out. According to the invention, nanopore sequencing is adopted, the advantages of the third-generation sequencing long reads are utilized, the whole structure variation and/or repetition region can be covered, and the method for detecting the within-alignment reads and the split reads is adopted, so that more accurate and precise structure variants are obtained, and the understanding of the structure variations and the effects of the structure variations on diseases, evolution and genetic diversity is deepened.

Drawings

FIG. 1 is a flow chart of the method for detecting genomic structural variation based on nanopore sequencing according to the present invention.

FIG. 2 is a distribution diagram of reads length. The abscissa represents the length distribution (bp) of reads, and the ordinate on the left represents the number of reads, corresponding to the histogram indicated by the arrow 1; the ordinate on the right represents the total number of bases (Mb) contained in reads greater than the corresponding length, corresponding to the curve indicated by the arrow 2; the dotted line indicated by arrow 3 indicates the length of N50 for reads.

FIG. 3 is a map of genome coverage. The abscissa is the chromosome position and the ordinate is the value obtained by taking the logarithm (log2) of the coverage depth of the corresponding position on the chromosome.

FIG. 4 is a distribution diagram of reads coverage depth. The basic situation of sequencing depth distribution is reflected, the abscissa is the sequencing depth, the left ordinate is the percentage of the base corresponding to the depth and corresponds to the curve indicated by the arrow 1, and the right ordinate is the percentage of the base corresponding to the depth and below and corresponds to the curve indicated by the arrow 2.

FIG. 5 shows the insertion type, and the region marked by the arrow is the detected insertion fragment, compared with the reference genome, the three generations of long reads information basically have the insertion of the sequence at the position indicated by the arrow, and it is seen from the figure that all reads support the same position in the reference genome basically, and the quality of the detection result is higher.

FIG. 6 shows deletion types, where the blank region on the map is the detected deletion fragment, all long reads aligned to the region have deletions at the genomic position, and the start and stop breakpoints are substantially the same, and the quality of the detection result is high.

FIG. 7 shows inversion type, where the inversion region is the structural variation for detecting inversion type, and inversion means that long reads are inverted in a certain region of the genome, and it can be seen from the figure that in the region of the genome, all long reads are inverted, and the start and stop breakpoints are substantially the same, and the quality of the detection result is high.

FIG. 8 is a schematic diagram of detection of exogenous insert, wherein insert is exogenous insert, consensus is consensus sequence constructed by assembly of detected reads, and Genome refers to the position of Genome.

FIG. 9 is a schematic diagram of detection of exogenous insert, where insertion is the exogenous insert, consensus is the consensus sequence constructed by assembly of detected reads, and Genome refers to the position of Genome.

Detailed Description

The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Example 1 method for detecting structural variation based on nanopore sequencing applied to detection of rice individual re-sequencing structural variation

The rice individual re-sequencing structure variation detection method comprises the following steps:

1. and performing quality control on the data of the nanopore sequencer, wherein the data mainly comprises a filter joint sequence, the mass value of reads is lower than 7, and the length of the reads is shorter than 500 bp. Counting clear data based on the filtered data, wherein the results are as follows:

TABLE 1 Clean data statistics

#DataType	SeqNum	SumBase	N50Len	N90Len	MeanLen	MaxLen	MeanQual
								Clean data	2001147	1300993204	21490	4007	10081	211522	9.33

Note: data type is Data type; SeqNum: the number of the sequences; SumBase total base number; n50 Len: data N50 length; n90 Len: data N90 length; MeanLen: average reads length; MaxLen: the longest Reads length; MeanQual: average Reads mass value.

2. Comparing clean data to a reference genome by using comparison software minimap to obtain an comparison file rice. According to the comparison result, randomly extracting 1000 pieces of comparison reads information, evaluating the error rate of the whole reads, comparing scores, comparison quality, main/sub-optimal ratio, counting distance statistical information of comparison differences (mismatch, indel) by a 100bp sliding window, initializing parameters, obtaining the score of the whole statistical type, discarding the comparison reads lower than the score, and ensuring higher comparison quality. Discarding reads with mapping quality less than 20, main/sub-optimal ratio less than 2, alignment score less than the minimum score of parameter initialization evaluation, reads soft-cutting more than 7 times, and remaining alignment reads information is the defined high-quality result.

3. Based on the high quality results on the comparison, the information of the within-alignment reads and the split reads is recorded. Aiming at the type of within-alignment, the MD label and the CIGAR character string information of each reads are recorded, and the genome position, the length of all mismatches and indels, the indels are extracted from the informationOver 50bp (default) is defined as a potential structural variation. The aligned segments of reads are then scanned using the Plane-sweet algorithm, and the coordinates of potential structural variations are recorded. If a scanned reads contains more than 3 noisy regions (mismatch and small inde regions), this read is discarded, and the structural variation record information on this read is also discarded. Aiming at the split-reads type, the reads are divided into two types according to a comparison result: (1) two times of reads soft cutting and (2) more than two times of reads soft cutting. Wherein the times of the reads soft cutting are more than 7, namely, the low-quality reads are judged to be discarded. For the type (1), if reads obtained by soft cutting twice are compared to the same chromosome and the coordinate directions are consistent, judging according to the recorded reads coordinates and genome coordinates: distance_reads-Distance_genome>50bp, defined as insertion; distance_genome-Distance_reads>50bp, defined as deletion; distance_genome>50bp, simultaneous Distance_reads<50bp, defined as repeat. If the reads of the two times of soft cutting are compared to the same chromosome and the coordinate directions are inconsistent, judging that the chromosome is inverted; if the reads of the soft cutting twice are aligned to different chromosomes, the translocation between the chromosomes is judged. For the type (2), if the segment has a short segment with overlap greater than 200bp and 40%, the coordinates of the two segments are not consistent, and the segment is judged to be inverted and repeated.

4. And integrating the structural variation based on the structural variation detected by the information of the within-alignment reads and the split reads and the corresponding supported number of the reads. Structural variation of the same type and a coordinate distance (structural variation start site and stop site) of less than 1000bp are judged as a structural variation. The number of supports for each structural variation detected, less than 10, is discarded (not integrated into the final structural variation detection result).

5. Clustering reads of all detected structural variation, carrying out structural variation typing, judging heterozygous structural variation and homozygous structural variation, judging homozygous structural variation when the allele frequency is more than 0.8, and judging heterozygous structural variation within 0.3-0.8. Large structural variation events on the genome, such as insertions, deletions, duplications, inversions, translocations, and chimeric structural variations, are detected. The number of structural variation assays (table 2) and the length distribution statistics (table 3) are as follows:

TABLE 2 structural variation detection numbers

Tot	DEL	DUP	INV	INS	TRA
						4725	580	84	123	3225	713

Note: tot: the total number of SVs; DEL: deletion; DUP: repeating; INV: inverting; INS: inserting; TRA: an interchromosomal translocation.

TABLE 3 structural variation detection Length distribution statistics

Len	DEL	DUP	INV	INS
					0-50bp	129	0	0	448
50-100bp	135	1	0	1474
					100-1000bp	209	11	8	1225
1000-10000bp	44	17	21	77
					10000+bp	63	55	94	1

Example 2 method for detecting structural variation based on nanopore sequencing applied to rice re-sequencing transgenic event and insert fragment search

The rice re-sequencing transgenic event and insert searching method comprises the following steps:

1. and performing quality control on the data of the nanopore sequencer, wherein the data mainly comprises a filter adaptor sequence, the mass value of reads is lower than 7, and the length of the reads is shorter than 500 bp. Counting clear data based on the filtered data, and obtaining the following results:

TABLE 4Clean data statistics

#DataType	SeqNum	SumBase	N50Len	N90Len	MeanLen	MaxLen	MeanQual
								Clean data	2826773	2057904109	28822	14733	22062	207257	8.32

2. The overall evaluation of the vector sequence, the full length of the insert (full), the promoter region (promoter), the two key sites lyz and the NOS region on the record carrier, results are as follows:

TABLE 5 evaluation of vector sequences

Region	start	end	Length(bp)
				full	1	11457	11457
Promoter	331	1498	1168
				lyz	1571	1963	393
NOS	1964	2238	275

3. Homology evaluation, performing blast comparison on a vector sequence and a rice genome, and detecting the specificity of an exogenous insertion fragment, wherein the results are as follows:

TABLE 6 homology evaluation

Query	subject	identity	length	mismatches	gap	q.start	q.end	s.start	s.end
										full	chr1	99.84	1236	0	1	337	1570	33831450	33832685
full	chr10	85.16	701	71	14	337	1029	13857729	13857054
										full	chr10	90.21	337	19	9	1245	1570	13857032	13856699
full	chr11		100	92	0	0	2240	2331	12132154											12132063
										full	chr11		100	92	0	0	2240	2331	30442831		30442740
full	chr11		100	86	0	0	245	330	12132063											12132148
										full	chr11		100	86	0	0	245	330	30442740		30442825
full	chr3	78.71	249	44	7	1324	1570	18500103	18500344

Note: query is a vector sequence; subject: a rice genome sequence; identity: consistency; length: comparing the lengths; mismatches: the number of mismatches; gap: aligning the gaps; q.start: the initial position of vector sequence alignment; q.end: termination positions of vector sequence alignment; s.start: the alignment starting position of the rice genome sequence; s.end: the alignment termination position of the rice genome sequence.

As can be seen from Table 6, the regions of homology of the vector sequence with the rice genome are concentrated in the Promoter region. On the genome, these regions are masked.

4. And (4) comparing clean data to a rice genome by using comparison software minimap. According to the comparison result, randomly extracting 1000 pieces of comparison reads information, evaluating the error rate of the whole reads, comparing scores, comparison quality, main/sub-optimal ratio, and distance statistical information of comparison difference (mismatch, indel) of 100bp sliding window statistics, performing parameter initialization, obtaining the score of the whole statistical type, discarding the comparison reads lower than the score, and ensuring higher comparison quality.

5. Statistical reads aligned with the sequence of the exogenously inserted (insertion) fragment, the aligned region being greater than 200bp in length and containing lyz and NOS (in part or in whole). Intercepting the sequences of the inserted fragments which are not aligned at the two ends of the reads, aligning the genome and setting the interval to be less than 10kb, and constructing a consistent sequence for the reads supporting the same insertion position.

6. Based on the high-quality result of the alignment, recording the information of the within-alignment reads and the split reads, and scanning the alignment position of the constructed consistency sequence on the genome. According to the comparison result, reads are divided into two types: (1) reads soft cut twice and (2) reads soft cut more than twice, the first type is considered here primarily. If the reads of the soft cutting twice are compared to the region and the coordinate directions are consistent, judging Distance according to the recorded reads coordinate and genome coordinate_reads-Distance_genome>And (3) counting the number of supported regions at the same time in a comparison region of 50bp, if the number of supported regions is more than 10, judging the potential exogenous insertion fragments, merging the fragments if the distance is less than 1000bp, and finally recording the position coordinates of the insertion fragments, the coordinates of the genome and the coordinates of the consistency sequence. The results of the transgene event and the insert detection are specifically as follows:

(1) genomic position: chr4: 34575202-. For the middle consensus sequence, the region indicated by arrow 1 indicates the region of the insert aligned, the region indicated by arrow 2 indicates the region of the rice genome aligned, and the remaining regions indicate the regions of the rice genome not aligned with the insert.

(2) Genomic position: chr2:35108656-35115260, constructing consensus sequence with total length of 17,330bp, shown in FIG. 9: in the Insertion sequence, Promoter is between the 1 st and 2 nd vertical lines, lyz is between the 3 rd and 4 th vertical lines, and NOS is between the 4 th and 5 th vertical lines. For the middle consensus sequence, the region indicated by arrow 1 indicates the region of the insert aligned, the region indicated by arrow 2 indicates the region of the rice genome aligned, and the remaining regions indicate the regions of the rice genome not aligned with the insert.

The above examples are only for describing the preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims

1. A method for detecting genome structure variation based on nanopore sequencing comprises the following steps:

(3) defining structural variation and typing based on reads support information; the method comprises the following steps:

a. recording MD and CIGAR information for each reads based on the thread-alignment reads information in the step (2),

in the step a, firstly, extracting the compared genome position, mismatch and indel information; then, for all indel information, if the indel information is larger than the self-defined 50bp, the indel information is judged to be potential structural variation; then, scanning comparison fragments of reads by adopting a Plane-sweet algorithm, and recording coordinates of potential structural variation; finally, potential insertion, deletion and noise regions can be identified based on the within-alignment information, if a reads contains more than 3 noise regions, the reads are discarded, and the structure variation record information on the reads is also discarded; the noise region is a mismatch and small indel region;

b. based on the split reads information in the step (2), performing soft cutting on each read for multiple times based on comparison information, wherein the longest mark is a main read, the rest marks are suboptimal comparison reads and other reads, and coordinate information is recorded for all comparison reads; further comprising: reads are classified into two types: (1) reads soft cut twice and mainly covers simple structural variations and (2) reads soft cut more than 2 times and mainly covers complex structural variation regions;

and (2) judging according to the recorded reads coordinates and genome coordinates if the reads of the two soft cuts are compared to the same chromosome and the coordinate directions are consistent according to the type (1): distance_reads-Distance_genome>50bp, defined as insertion; distance_genome-Distance_reads>50bp, defined as deletion; distance_genome>50bp, simultaneous Distance_reads<50bp, defined as repeat;

if the reads of the two soft cutting times are compared to the same chromosome and the coordinate directions are inconsistent, judging that inversion is carried out; if the reads subjected to the soft cutting twice are compared with different chromosomes, judging that the translocation between the chromosomes is carried out;

for the type (2), if the short segment with overlap greater than 200bp and more than 40% exists in the segment, the coordinates of the two segments are inconsistent, and the segment is judged to be inverted and repeated;

c. merging the potential structural variation detected in the step a and the step b, simultaneously counting the structural variation of each reads, merging the structural variations of the same type at the same position of less than 1000bp, simultaneously recording the supported number of the reads, and discarding the structural variation information if the supported number of the reads is less than 10 self-defined pieces; or judging the structure to be potential structure variation, needing more reads information verification, and obtaining the final structure variation information;

(4) integrating the variation types of the multi-sample data, wherein the method is based on the reads merging information in the step (3) to perform structural variation clustering, and the allele frequency is judged to be homozygous structural variation when being more than 0.8, and is judged to be heterozygous structural variation when being within 0.3-0.8; based on the genomic position and the variation type of the structural variation of each sample, the data of the genomic position and the variation type of the structural variation of all samples are merged, and the two types of 'existence' and 'deletion' of each site are marked at the same time.

2. The method of claim 1, wherein the nanopore sequencing data of step (1) is quality controlled and genome mapping is performed, comprising the steps of:

a. filtering the sequencing adaptor sequence, filtering reads with the quality value less than 7 and filtering reads with the length less than 500 bp;

3. The method of claim 1, wherein mining information of within-alignment reads and split reads in step (2) comprises the steps of:

a. performing parameter initialization evaluation based on the long reads information compared in the step (1): randomly extracting 1000 reads, evaluating the overall error rate of the reads and the comparison score of the reads, counting the distance of comparison difference based on a 100bp sliding window of the reads, and counting the ratio of the main reads to the suboptimal comparison reads of each read;

b. and (c) according to the comparison score and mapping quality of reads, simultaneously evaluating the reliability of the specific reads by combining the parameters in the step a, and discarding the following comparison reads information: mapping quality is less than 20, the ratio of primary/secondary optimal comparison reads comparison scores is less than 2, the reads comparison scores are less than the minimum comparison scores, scoring is from parameter initialization evaluation, and reads soft cutting is carried out for more than 7 times;

c. and recording the information of the thread-alignment and the split reads aiming at the residual reads after filtering according to the comparison information.

4. The method of any one or more of the following uses of claim 1-3,

(1) detecting variations in the mutant and wild-type material;

(2) detecting variations between extremely resistant materials;

(3) detecting a transgenic event and searching the position of an insert;