CN110600078B - Method for detecting genome structure variation based on nanopore sequencing - Google Patents

Method for detecting genome structure variation based on nanopore sequencing Download PDF

Info

Publication number
CN110600078B
CN110600078B CN201910786443.0A CN201910786443A CN110600078B CN 110600078 B CN110600078 B CN 110600078B CN 201910786443 A CN201910786443 A CN 201910786443A CN 110600078 B CN110600078 B CN 110600078B
Authority
CN
China
Prior art keywords
reads
variation
information
comparison
genome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910786443.0A
Other languages
Chinese (zh)
Other versions
CN110600078A (en
Inventor
郑洪坤
王运通
李绪明
邓德晶
梁若冰
王晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Biomarker Technologies Co ltd
Original Assignee
Beijing Biomarker Technologies Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Biomarker Technologies Co ltd filed Critical Beijing Biomarker Technologies Co ltd
Priority to CN201910786443.0A priority Critical patent/CN110600078B/en
Publication of CN110600078A publication Critical patent/CN110600078A/en
Application granted granted Critical
Publication of CN110600078B publication Critical patent/CN110600078B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Abstract

The invention provides a method for detecting genome structure variation based on nanopore sequencing, which comprises the steps of performing nanopore sequencing data quality control and performing genome mapping; mining the information of the within-alignment reads and the split reads; defining structural variations and typing; multi-sample data variant type integration. The method can detect the variation of the mutant and the wild type material, the variation between the extremely resistant materials, the detection of the transgenic event and the position search of the inserted fragment; the method can also organically integrate results in large sample size and assist in correcting the variation sites; providing abundant structural variation data aiming at a natural resource group, and carrying out subsequent GWAS analysis; the method can detect common structural variation of large fragments, and has high accuracy and precision in chimeric variation detection; meanwhile, clustering is carried out aiming at reads where the structural variation of the typing is located, and homozygosity and heterozygosity of the structural variation are judged.

Description

Method for detecting genome structure variation based on nanopore sequencing
Technical Field
The invention belongs to the technical field of bioinformatics, and particularly relates to a method for detecting genome structural variation based on nanopore sequencing, and particularly relates to offline data quality control, reference genome comparison, structural variation detection and final large-sample-volume data integration of nanopore sequencing.
Background
Structural genomic variations (structural variants), which generally refer to sequence changes and positional relationship changes of large fragments on a genome, are abundant in variant types, including large fragment sequence insertions or deletions (Big indels) with lengths of 50bp or more, Tandem repeats (Tandem repeat), chromosomal Inversion (Inversion), chromosomal Translocation (Translocation), Copy Number Variation (CNV), and more complex-form chimeric variations. Compared with SNP (single nucleotide polymorphism), the structural variation accounts for a larger proportion of the variation base number, has larger influence on the genome, and often brings great influence on the living body once the variation occurs. On the top of human beings, some rare and same structural variations and diseases are correlated with each other and even directly cause the diseases, such as autism, obesity, schizophrenia and cancer, and the like are related to the structural variations; on plants, structural variations are associated with many phenotypic variations, biotic/abiotic stresses and are therefore an increasingly important field of research.
Most high-throughput sequencing technologies are basically used for sequencing short-segment DNA (generally 150-300bp), and are difficult to analyze large structural variation, and relatively low in accuracy and precision. Therefore, there is a need in the art for a method that can accurately and precisely detect genomic structural variations by sequencing long-fragment DNA.
Disclosure of Invention
The invention aims to solve the defect that no complete method for detecting structural variation based on nanopore sequencing for long-fragment DNA exists in the prior art, and provides a method for detecting large-fragment structural variation and mosaic structural variation based on reads comparison (within-alignment and split reads) information, and performing structural variation clustering by combining reads support information to judge heterozygosity and homozygosity.
The invention provides a method for detecting genome structure variation based on nanopore sequencing, which comprises the following steps:
(1) performing nanopore sequencing data quality control, and performing genome mapping;
(2) mining information of within-alignment reads and split reads based on the comparison result in the step (1);
(3) defining structural variation and typing based on reads support information;
(4) multi-sample data variant type integration.
The flowchart of the method for detecting the genome structural variation based on the nanopore sequencing is shown in figure 1.
In the method, the quality control of the nanopore sequencing data in the step (1) and the genome mapping comprise the following steps: a. filtering the sequencing adaptor sequence, filtering reads with the quality value less than 7 and filtering reads with the length less than 500 bp;
b. mapping the clean data after quality control to a reference genome, and carrying out statistics on comparison efficiency, whole genome reads coverage and sequencing depth;
c. and (c) performing overall comparison quality evaluation on the nanopore sequencing data based on the information in the step b, judging that the third-generation data is abnormal, and performing downstream analysis if the comparison efficiency is higher than 70% and the whole genome coverage is higher than 50%.
The method in the step (1) comprises the following steps: nanopore sequencer off-line data filter junctions, low quality reads, and short fragment reads. Counting the length distribution of Clean data (see fig. 2), wherein the abscissa represents the length distribution (bp) of reads, and the ordinate on the left side represents the number of reads, and corresponds to a histogram indicated by an arrow 1; the ordinate on the right represents the total number of bases (Mb) contained in reads greater than the respective length, corresponding to the curve indicated by arrow 2; the dotted line indicated by arrow 3 indicates the length of N50 for reads. Clean data is then aligned to the reference genome, where the percentage of bases covered by Reads on the genome is called genome coverage (see fig. 3), which is mainly affected by the depth of sequencing and how close the sample is related to the reference genome. The number of Reads covered on the base is the depth of coverage (see FIG. 4). The depth of coverage of the genome affects the accuracy of mutation detection, and in regions with higher depth of coverage, the accuracy of mutation detection is higher. And (4) confirming the data quantity and quality of downstream analysis through evaluation judgment of long reads distribution, genome coverage and coverage depth after filtering.
In the method for detecting genome structure variation based on nanopore sequencing, the information of within-alignment reads and split reads is mined in the step (2), and the method comprises the following steps:
a. performing parameter initialization evaluation based on the long reads information compared in the step (1): randomly extracting 1000 reads, evaluating the overall error rate of the reads and the comparison score of the reads, counting the distance of comparison differences (mismatch, indel) by a 100bp sliding window based on the reads, and counting the ratio of the main reads and the suboptimal comparison scores of each read;
b. according to the comparison score and mapping quality of reads and the reliability of evaluating the specific reads by combining the parameters in the step a, discarding the following comparison reads information: mapping quality is less than 20, the ratio of primary/secondary optimal comparison reads comparison scores is less than 2, the reads comparison scores are less than the minimum comparison scores (scoring is from parameter initialization evaluation), and reads are subjected to soft cutting for more than 7 times;
c. and recording the information of the thread-alignment reads and the split reads aiming at the residual reads after filtering according to the comparison information.
And (2) performing parameter initialization evaluation by combining 1000 pieces of randomly extracted reads information according to the comparison score and the comparison quality, discarding reads with mapping quality less than 20, main/suboptimal ratio less than 2, comparison score less than the minimum score of parameter initialization evaluation, and reads soft-cutting for more than 7 times, wherein the remaining comparison reads information is the defined high-quality result. And recording the thread-alignment reads and the split reads as subsequent structural variation judgment data according to the comparison information.
In the method of the present invention, the step (3) of defining the structural variation type and the typing based on the reads support information includes the steps of:
a. based on the thread-alignment reads information in the step (2), for the MD and CIGAR information of each reads record,
b. based on the split reads information in the step (2), performing soft cutting on each read for multiple times based on comparison information, wherein the longest mark is a main read, the rest marks are suboptimal comparison reads and other reads, and coordinate information is recorded for all comparison reads;
c. merging the potential structural variation detected in the step a and the step b, simultaneously counting the structural variation of each reads, merging the structural variations of the same type at the same position of less than 1000bp, simultaneously recording the supported number of the reads, and discarding the structural variation information if the supported number of the reads is less than 10 self-defined pieces; or judging the structure variation as a potential structure variation, needing more reads information verification, and the rest is the final structure variation information.
In the step (3), firstly, extracting the aligned genome position, mismatch and indel information; then, for all indel information, if the indel information is larger than the self-defined 50bp, the indel information is judged to be potential structural variation; then, scanning comparison segments of reads by adopting a Plane-sweet algorithm, and recording coordinates of potential structural variation; finally, potential insertion, deletion and noise regions can be identified based on the thread-alignment reads information, if one read contains more than 3 noise regions, the read is discarded, and the structural variation record information on the read is also discarded; the noise regions are the mismatch and small indel regions.
In b, the method further comprises the following steps: reads are classified into two types: (1) reads soft cut twice and covers mainly simple structural variations and (2) reads soft cut more than 2 times and covers mainly complex structural variation regions.
And in the step b, for the type (1), if the reads obtained by soft cutting twice are compared to the same chromosome and the coordinate directions are consistent, judging according to the recorded reads coordinates and genome coordinates: distancereads-Distancegenome>50bp, defined as insertion (see FIG. 5); distancegenome-Distancereads>50bp, defined as deletion (see FIG. 6); distancegenome>50bp, simultaneous Distancereads<50bp, defined as repeat (see FIG. 7);
if the reads of the two times of soft cutting are compared to the same chromosome and the coordinate directions are inconsistent, determining that inversion is carried out; and if the reads subjected to the soft cutting twice are aligned to different chromosomes, judging the easy location between the chromosomes.
For the type (2), if the segment has a short segment with overlap greater than 200bp and more than 40%, the coordinates of the two segments are not consistent, and the segment is judged to be inverted and repeated.
In the method, the multi-sample data variation type integration in the step (4) is carried out, the method is based on the reads combined information in the step (3) to carry out structural variation clustering, the allele frequency is judged to be homozygous structural variation when being more than 0.8, and the allele frequency is judged to be heterozygous structural variation when being within 0.3-0.8; the structural variation of each sample is in the genome position and variation type, and the data of all samples are merged, and the two types of 'existence' and 'deletion' of each site are marked simultaneously.
The invention provides one or more of the following applications of the method for detecting the genome structural variation based on the nanopore sequencing,
(1) detecting variations in the mutant and wild-type material;
(2) detecting variations between extremely resistant materials;
(3) detecting a transgenic event and searching the position of an insert;
(4) organically integrating results aiming at a large sample size, and assisting in correcting the variation sites;
(5) providing abundant structural variation data for a natural resource group for GWAS analysis;
(6) detecting structural variation and/or chimeric variation of the large fragment;
(7) clustering is carried out aiming at reads where the typed structural variation is located, and homozygosity and heterozygosity of the structural variation are judged.
The invention has the beneficial effects that: (1) the method realizes parameter evaluation and filtration of low-quality comparison information; compared with the second generation data, the accuracy is estimated to be about 90%, and the accuracy is more than 80%. (2) The detection of the structural variation of small and medium-sized fragments and large fragments (>10kb) is realized; (3) the method can detect the structural variation of the complex region of the genome; (4) automated processing of large sample size structural variants.
The method is not only suitable for individual re-sequencing, but also for excavating mutation and structural variation of wild materials; detecting a transgenic event and searching the position of an insert; detection of variation between extremely resistant materials. Meanwhile, the method is suitable for the aspect of group body weight sequencing, the function of structural variation in group genetic evolution is researched, the domestication relation is selected, and downstream structural variation GWAS analysis can also be carried out. According to the invention, nanopore sequencing is adopted, the advantages of the third-generation sequencing long reads are utilized, the whole structure variation and/or repetition region can be covered, and the method for detecting the within-alignment reads and the split reads is adopted, so that more accurate and precise structure variants are obtained, and the understanding of the structure variations and the effects of the structure variations on diseases, evolution and genetic diversity is deepened.
Drawings
FIG. 1 is a flow chart of the method for detecting genomic structural variation based on nanopore sequencing according to the present invention.
FIG. 2 is a distribution diagram of reads length. The abscissa represents the length distribution (bp) of reads, and the ordinate on the left represents the number of reads, corresponding to the histogram indicated by the arrow 1; the ordinate on the right represents the total number of bases (Mb) contained in reads greater than the corresponding length, corresponding to the curve indicated by the arrow 2; the dotted line indicated by arrow 3 indicates the length of N50 for reads.
FIG. 3 is a map of genome coverage. The abscissa is the chromosome position and the ordinate is the value obtained by taking the logarithm (log2) of the coverage depth of the corresponding position on the chromosome.
FIG. 4 is a distribution diagram of reads coverage depth. The basic situation of sequencing depth distribution is reflected, the abscissa is the sequencing depth, the left ordinate is the percentage of the base corresponding to the depth and corresponds to the curve indicated by the arrow 1, and the right ordinate is the percentage of the base corresponding to the depth and below and corresponds to the curve indicated by the arrow 2.
FIG. 5 shows the insertion type, and the region marked by the arrow is the detected insertion fragment, compared with the reference genome, the three generations of long reads information basically have the insertion of the sequence at the position indicated by the arrow, and it is seen from the figure that all reads support the same position in the reference genome basically, and the quality of the detection result is higher.
FIG. 6 shows deletion types, where the blank region on the map is the detected deletion fragment, all long reads aligned to the region have deletions at the genomic position, and the start and stop breakpoints are substantially the same, and the quality of the detection result is high.
FIG. 7 shows inversion type, where the inversion region is the structural variation for detecting inversion type, and inversion means that long reads are inverted in a certain region of the genome, and it can be seen from the figure that in the region of the genome, all long reads are inverted, and the start and stop breakpoints are substantially the same, and the quality of the detection result is high.
FIG. 8 is a schematic diagram of detection of exogenous insert, wherein insert is exogenous insert, consensus is consensus sequence constructed by assembly of detected reads, and Genome refers to the position of Genome.
FIG. 9 is a schematic diagram of detection of exogenous insert, where insertion is the exogenous insert, consensus is the consensus sequence constructed by assembly of detected reads, and Genome refers to the position of Genome.
Detailed Description
The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
Example 1 method for detecting structural variation based on nanopore sequencing applied to detection of rice individual re-sequencing structural variation
The rice individual re-sequencing structure variation detection method comprises the following steps:
1. and performing quality control on the data of the nanopore sequencer, wherein the data mainly comprises a filter joint sequence, the mass value of reads is lower than 7, and the length of the reads is shorter than 500 bp. Counting clear data based on the filtered data, wherein the results are as follows:
TABLE 1 Clean data statistics
#DataType SeqNum SumBase N50Len N90Len MeanLen MaxLen MeanQual
Clean data 2001147 1300993204 21490 4007 10081 211522 9.33
Note: data type is Data type; SeqNum: the number of the sequences; SumBase total base number; n50 Len: data N50 length; n90 Len: data N90 length; MeanLen: average reads length; MaxLen: the longest Reads length; MeanQual: average Reads mass value.
2. Comparing clean data to a reference genome by using comparison software minimap to obtain an comparison file rice. According to the comparison result, randomly extracting 1000 pieces of comparison reads information, evaluating the error rate of the whole reads, comparing scores, comparison quality, main/sub-optimal ratio, counting distance statistical information of comparison differences (mismatch, indel) by a 100bp sliding window, initializing parameters, obtaining the score of the whole statistical type, discarding the comparison reads lower than the score, and ensuring higher comparison quality. Discarding reads with mapping quality less than 20, main/sub-optimal ratio less than 2, alignment score less than the minimum score of parameter initialization evaluation, reads soft-cutting more than 7 times, and remaining alignment reads information is the defined high-quality result.
3. Based on the high quality results on the comparison, the information of the within-alignment reads and the split reads is recorded. Aiming at the type of within-alignment, the MD label and the CIGAR character string information of each reads are recorded, and the genome position, the length of all mismatches and indels, the indels are extracted from the informationOver 50bp (default) is defined as a potential structural variation. The aligned segments of reads are then scanned using the Plane-sweet algorithm, and the coordinates of potential structural variations are recorded. If a scanned reads contains more than 3 noisy regions (mismatch and small inde regions), this read is discarded, and the structural variation record information on this read is also discarded. Aiming at the split-reads type, the reads are divided into two types according to a comparison result: (1) two times of reads soft cutting and (2) more than two times of reads soft cutting. Wherein the times of the reads soft cutting are more than 7, namely, the low-quality reads are judged to be discarded. For the type (1), if reads obtained by soft cutting twice are compared to the same chromosome and the coordinate directions are consistent, judging according to the recorded reads coordinates and genome coordinates: distancereads-Distancegenome>50bp, defined as insertion; distancegenome-Distancereads>50bp, defined as deletion; distancegenome>50bp, simultaneous Distancereads<50bp, defined as repeat. If the reads of the two times of soft cutting are compared to the same chromosome and the coordinate directions are inconsistent, judging that the chromosome is inverted; if the reads of the soft cutting twice are aligned to different chromosomes, the translocation between the chromosomes is judged. For the type (2), if the segment has a short segment with overlap greater than 200bp and 40%, the coordinates of the two segments are not consistent, and the segment is judged to be inverted and repeated.
4. And integrating the structural variation based on the structural variation detected by the information of the within-alignment reads and the split reads and the corresponding supported number of the reads. Structural variation of the same type and a coordinate distance (structural variation start site and stop site) of less than 1000bp are judged as a structural variation. The number of supports for each structural variation detected, less than 10, is discarded (not integrated into the final structural variation detection result).
5. Clustering reads of all detected structural variation, carrying out structural variation typing, judging heterozygous structural variation and homozygous structural variation, judging homozygous structural variation when the allele frequency is more than 0.8, and judging heterozygous structural variation within 0.3-0.8. Large structural variation events on the genome, such as insertions, deletions, duplications, inversions, translocations, and chimeric structural variations, are detected. The number of structural variation assays (table 2) and the length distribution statistics (table 3) are as follows:
TABLE 2 structural variation detection numbers
Tot DEL DUP INV INS TRA
4725 580 84 123 3225 713
Note: tot: the total number of SVs; DEL: deletion; DUP: repeating; INV: inverting; INS: inserting; TRA: an interchromosomal translocation.
TABLE 3 structural variation detection Length distribution statistics
Len DEL DUP INV INS
0-50bp 129 0 0 448
50-100bp 135 1 0 1474
100-1000bp 209 11 8 1225
1000-10000bp 44 17 21 77
10000+bp 63 55 94 1
Example 2 method for detecting structural variation based on nanopore sequencing applied to rice re-sequencing transgenic event and insert fragment search
The rice re-sequencing transgenic event and insert searching method comprises the following steps:
1. and performing quality control on the data of the nanopore sequencer, wherein the data mainly comprises a filter adaptor sequence, the mass value of reads is lower than 7, and the length of the reads is shorter than 500 bp. Counting clear data based on the filtered data, and obtaining the following results:
TABLE 4Clean data statistics
#DataType SeqNum SumBase N50Len N90Len MeanLen MaxLen MeanQual
Clean data 2826773 2057904109 28822 14733 22062 207257 8.32
Note: data type is Data type; SeqNum: the number of the sequences; SumBase total base number; n50 Len: data N50 length; n90 Len: data N90 length; MeanLen: average reads length; MaxLen: the longest Reads length; MeanQual: average Reads mass value.
2. The overall evaluation of the vector sequence, the full length of the insert (full), the promoter region (promoter), the two key sites lyz and the NOS region on the record carrier, results are as follows:
TABLE 5 evaluation of vector sequences
Region start end Length(bp)
full 1 11457 11457
Promoter 331 1498 1168
lyz 1571 1963 393
NOS 1964 2238 275
3. Homology evaluation, performing blast comparison on a vector sequence and a rice genome, and detecting the specificity of an exogenous insertion fragment, wherein the results are as follows:
TABLE 6 homology evaluation
Query subject identity length mismatches gap q.start q.end s.start s.end
full chr1 99.84 1236 0 1 337 1570 33831450 33832685
full chr10 85.16 701 71 14 337 1029 13857729 13857054
full chr10 90.21 337 19 9 1245 1570 13857032 13856699
full chr11 100 92 0 0 2240 2331 12132154 12132063
full chr11 100 92 0 0 2240 2331 30442831 30442740
full chr11 100 86 0 0 245 330 12132063 12132148
full chr11 100 86 0 0 245 330 30442740 30442825
full chr3 78.71 249 44 7 1324 1570 18500103 18500344
Note: query is a vector sequence; subject: a rice genome sequence; identity: consistency; length: comparing the lengths; mismatches: the number of mismatches; gap: aligning the gaps; q.start: the initial position of vector sequence alignment; q.end: termination positions of vector sequence alignment; s.start: the alignment starting position of the rice genome sequence; s.end: the alignment termination position of the rice genome sequence.
As can be seen from Table 6, the regions of homology of the vector sequence with the rice genome are concentrated in the Promoter region. On the genome, these regions are masked.
4. And (4) comparing clean data to a rice genome by using comparison software minimap. According to the comparison result, randomly extracting 1000 pieces of comparison reads information, evaluating the error rate of the whole reads, comparing scores, comparison quality, main/sub-optimal ratio, and distance statistical information of comparison difference (mismatch, indel) of 100bp sliding window statistics, performing parameter initialization, obtaining the score of the whole statistical type, discarding the comparison reads lower than the score, and ensuring higher comparison quality.
5. Statistical reads aligned with the sequence of the exogenously inserted (insertion) fragment, the aligned region being greater than 200bp in length and containing lyz and NOS (in part or in whole). Intercepting the sequences of the inserted fragments which are not aligned at the two ends of the reads, aligning the genome and setting the interval to be less than 10kb, and constructing a consistent sequence for the reads supporting the same insertion position.
6. Based on the high-quality result of the alignment, recording the information of the within-alignment reads and the split reads, and scanning the alignment position of the constructed consistency sequence on the genome. According to the comparison result, reads are divided into two types: (1) reads soft cut twice and (2) reads soft cut more than twice, the first type is considered here primarily. If the reads of the soft cutting twice are compared to the region and the coordinate directions are consistent, judging Distance according to the recorded reads coordinate and genome coordinatereads-Distancegenome>And (3) counting the number of supported regions at the same time in a comparison region of 50bp, if the number of supported regions is more than 10, judging the potential exogenous insertion fragments, merging the fragments if the distance is less than 1000bp, and finally recording the position coordinates of the insertion fragments, the coordinates of the genome and the coordinates of the consistency sequence. The results of the transgene event and the insert detection are specifically as follows:
(1) genomic position: chr4: 34575202-. For the middle consensus sequence, the region indicated by arrow 1 indicates the region of the insert aligned, the region indicated by arrow 2 indicates the region of the rice genome aligned, and the remaining regions indicate the regions of the rice genome not aligned with the insert.
(2) Genomic position: chr2:35108656-35115260, constructing consensus sequence with total length of 17,330bp, shown in FIG. 9: in the Insertion sequence, Promoter is between the 1 st and 2 nd vertical lines, lyz is between the 3 rd and 4 th vertical lines, and NOS is between the 4 th and 5 th vertical lines. For the middle consensus sequence, the region indicated by arrow 1 indicates the region of the insert aligned, the region indicated by arrow 2 indicates the region of the rice genome aligned, and the remaining regions indicate the regions of the rice genome not aligned with the insert.
The above examples are only for describing the preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims (4)

1. A method for detecting genome structure variation based on nanopore sequencing comprises the following steps:
(1) performing nanopore sequencing data quality control, and performing genome mapping;
(2) mining information of within-alignment reads and split reads based on the comparison result in the step (1);
(3) defining structural variation and typing based on reads support information; the method comprises the following steps:
a. recording MD and CIGAR information for each reads based on the thread-alignment reads information in the step (2),
in the step a, firstly, extracting the compared genome position, mismatch and indel information; then, for all indel information, if the indel information is larger than the self-defined 50bp, the indel information is judged to be potential structural variation; then, scanning comparison fragments of reads by adopting a Plane-sweet algorithm, and recording coordinates of potential structural variation; finally, potential insertion, deletion and noise regions can be identified based on the within-alignment information, if a reads contains more than 3 noise regions, the reads are discarded, and the structure variation record information on the reads is also discarded; the noise region is a mismatch and small indel region;
b. based on the split reads information in the step (2), performing soft cutting on each read for multiple times based on comparison information, wherein the longest mark is a main read, the rest marks are suboptimal comparison reads and other reads, and coordinate information is recorded for all comparison reads; further comprising: reads are classified into two types: (1) reads soft cut twice and mainly covers simple structural variations and (2) reads soft cut more than 2 times and mainly covers complex structural variation regions;
and (2) judging according to the recorded reads coordinates and genome coordinates if the reads of the two soft cuts are compared to the same chromosome and the coordinate directions are consistent according to the type (1): distancereads-Distancegenome>50bp, defined as insertion; distancegenome-Distancereads>50bp, defined as deletion; distancegenome>50bp, simultaneous Distancereads<50bp, defined as repeat;
if the reads of the two soft cutting times are compared to the same chromosome and the coordinate directions are inconsistent, judging that inversion is carried out; if the reads subjected to the soft cutting twice are compared with different chromosomes, judging that the translocation between the chromosomes is carried out;
for the type (2), if the short segment with overlap greater than 200bp and more than 40% exists in the segment, the coordinates of the two segments are inconsistent, and the segment is judged to be inverted and repeated;
c. merging the potential structural variation detected in the step a and the step b, simultaneously counting the structural variation of each reads, merging the structural variations of the same type at the same position of less than 1000bp, simultaneously recording the supported number of the reads, and discarding the structural variation information if the supported number of the reads is less than 10 self-defined pieces; or judging the structure to be potential structure variation, needing more reads information verification, and obtaining the final structure variation information;
(4) integrating the variation types of the multi-sample data, wherein the method is based on the reads merging information in the step (3) to perform structural variation clustering, and the allele frequency is judged to be homozygous structural variation when being more than 0.8, and is judged to be heterozygous structural variation when being within 0.3-0.8; based on the genomic position and the variation type of the structural variation of each sample, the data of the genomic position and the variation type of the structural variation of all samples are merged, and the two types of 'existence' and 'deletion' of each site are marked at the same time.
2. The method of claim 1, wherein the nanopore sequencing data of step (1) is quality controlled and genome mapping is performed, comprising the steps of:
a. filtering the sequencing adaptor sequence, filtering reads with the quality value less than 7 and filtering reads with the length less than 500 bp;
b. mapping the clean data after quality control to a reference genome, and carrying out statistics on comparison efficiency, whole genome reads coverage and sequencing depth;
c. and (c) performing overall comparison quality evaluation on the nanopore sequencing data based on the information in the step b, judging that the third-generation data is abnormal, and performing downstream analysis if the comparison efficiency is higher than 70% and the whole genome coverage is higher than 50%.
3. The method of claim 1, wherein mining information of within-alignment reads and split reads in step (2) comprises the steps of:
a. performing parameter initialization evaluation based on the long reads information compared in the step (1): randomly extracting 1000 reads, evaluating the overall error rate of the reads and the comparison score of the reads, counting the distance of comparison difference based on a 100bp sliding window of the reads, and counting the ratio of the main reads to the suboptimal comparison reads of each read;
b. and (c) according to the comparison score and mapping quality of reads, simultaneously evaluating the reliability of the specific reads by combining the parameters in the step a, and discarding the following comparison reads information: mapping quality is less than 20, the ratio of primary/secondary optimal comparison reads comparison scores is less than 2, the reads comparison scores are less than the minimum comparison scores, scoring is from parameter initialization evaluation, and reads soft cutting is carried out for more than 7 times;
c. and recording the information of the thread-alignment and the split reads aiming at the residual reads after filtering according to the comparison information.
4. The method of any one or more of the following uses of claim 1-3,
(1) detecting variations in the mutant and wild-type material;
(2) detecting variations between extremely resistant materials;
(3) detecting a transgenic event and searching the position of an insert;
(4) organically integrating results aiming at a large sample size, and assisting in correcting the variation sites;
(5) providing abundant structural variation data for a natural resource group for GWAS analysis;
(6) detecting structural variation and/or chimeric variation of the large fragment;
(7) clustering is carried out aiming at reads where the typed structural variation is located, and homozygosity and heterozygosity of the structural variation are judged.
CN201910786443.0A 2019-08-23 2019-08-23 Method for detecting genome structure variation based on nanopore sequencing Active CN110600078B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910786443.0A CN110600078B (en) 2019-08-23 2019-08-23 Method for detecting genome structure variation based on nanopore sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910786443.0A CN110600078B (en) 2019-08-23 2019-08-23 Method for detecting genome structure variation based on nanopore sequencing

Publications (2)

Publication Number Publication Date
CN110600078A CN110600078A (en) 2019-12-20
CN110600078B true CN110600078B (en) 2022-03-18

Family

ID=68855365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910786443.0A Active CN110600078B (en) 2019-08-23 2019-08-23 Method for detecting genome structure variation based on nanopore sequencing

Country Status (1)

Country Link
CN (1) CN110600078B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111261225B (en) * 2020-02-06 2022-08-16 西安交通大学 Reverse correlation complex variation detection method based on second-generation sequencing data
CN111445950B (en) * 2020-03-19 2022-10-25 西安交通大学 High-fault-tolerance genome complex structure variation detection method based on filtering strategy
CN111292806B (en) * 2020-03-27 2022-04-26 武汉古奥基因科技有限公司 Transcriptome analysis method by using nanopore sequencing
CN111583996B (en) * 2020-04-20 2023-03-28 西安交通大学 Model-independent genome structure variation detection system and method
CN112397142B (en) * 2020-10-13 2023-02-03 山东大学 Gene variation detection method and system for multi-core processor
CN112289376B (en) * 2020-10-26 2021-07-06 北京吉因加医学检验实验室有限公司 Method and device for detecting somatic cell mutation
CN112646868A (en) * 2020-12-23 2021-04-13 赣南医学院 Method for detecting pathogenic molecules based on nanopore sequencing
CN112634988B (en) * 2021-01-07 2021-10-08 内江师范学院 Python language-based gene variation detection method and system
WO2023029044A1 (en) * 2021-09-06 2023-03-09 百图生科(北京)智能技术有限公司 Single-cell sequencing method and apparatus, and device, medium and program product
CN114464252B (en) * 2022-01-26 2023-06-27 深圳吉因加医学检验实验室 Method and device for detecting structural variation
CN115910199B (en) * 2022-11-01 2023-07-14 哈尔滨工业大学 Three-generation sequencing data structure variation detection method based on comparison framework
CN115762633B (en) * 2022-11-23 2024-01-23 哈尔滨工业大学 Genome structure variation genotype correction method based on three-generation sequencing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016000267A1 (en) * 2014-07-04 2016-01-07 深圳华大基因股份有限公司 Method for determining the sequence of a probe and method for detecting genomic structural variation
CN105849276A (en) * 2013-10-01 2016-08-10 生命技术公司 Systems and methods for detecting structural variants
CN106202991A (en) * 2016-06-30 2016-12-07 厦门艾德生物医药科技股份有限公司 The detection method of abrupt information in a kind of genome multiplex amplification order-checking product
CN110036117A (en) * 2016-12-16 2019-07-19 豪夫迈·罗氏有限公司 Increase the method for the treating capacity of single-molecule sequencing by multi-joint short dna segment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190080045A1 (en) * 2017-09-13 2019-03-14 The Jackson Laboratory Detection of high-resolution structural variants using long-read genome sequence analysis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105849276A (en) * 2013-10-01 2016-08-10 生命技术公司 Systems and methods for detecting structural variants
WO2016000267A1 (en) * 2014-07-04 2016-01-07 深圳华大基因股份有限公司 Method for determining the sequence of a probe and method for detecting genomic structural variation
CN106202991A (en) * 2016-06-30 2016-12-07 厦门艾德生物医药科技股份有限公司 The detection method of abrupt information in a kind of genome multiplex amplification order-checking product
CN110036117A (en) * 2016-12-16 2019-07-19 豪夫迈·罗氏有限公司 Increase the method for the treating capacity of single-molecule sequencing by multi-joint short dna segment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Mapping and phasing of structural variation in patient genomes using nanopore sequencing;Mircea Cretu Stancu et al.;《nature communications》;20171106;第1-13页 *
Nanopore sequencing detects structural variants in cancer;Alexis L. Norris et al.;《CANCER BIOLOGY & THERAPY》;20160224;第17卷(第3期);第246-253页 *

Also Published As

Publication number Publication date
CN110600078A (en) 2019-12-20

Similar Documents

Publication Publication Date Title
CN110600078B (en) Method for detecting genome structure variation based on nanopore sequencing
KR102638152B1 (en) Verification method and system for sequence variant calling
CN109346130B (en) Method for directly obtaining micro-haplotype from whole genome re-sequencing data and typing micro-haplotype
JP2018535481A5 (en)
Lepoittevin et al. Single‐nucleotide polymorphism discovery and validation in high‐density SNP array for genetic analysis in European white oaks
CN111755072B (en) Method and device for simultaneously detecting methylation level, genome variation and insertion fragment
CN110189796A (en) A kind of sheep full-length genome resurveys sequence analysis method
CN109584957B (en) Detection kit for capturing α thalassemia related gene copy number
CN103114150A (en) Single nucleotide polymorphism site identification method based on digestion library-establishing and sequencing and bayesian statistics
CN108304694B (en) Method for analyzing gene mutation based on second-generation sequencing data
JP2023523002A (en) Structural variant detection in chromosomal proximity experiments
CN111088382A (en) Corn whole genome SNP chip and application thereof
JP2023526252A (en) Detection of homologous recombination repair defects
CN110444253B (en) Method and system suitable for mixed pool gene positioning
CN115305290A (en) Chicken liquid chip and application thereof
Gong et al. Evolution of the sex-determining region in Ginkgo biloba
CN111485026A (en) Sheep birth weight related SNP (single nucleotide polymorphism) site, application, molecular marker and primer
CN113930492A (en) Biological information processing method for paternity test of contaminated sample
CN105528532B (en) A kind of characteristic analysis method in rna editing site
KR20140099189A (en) A method and apparatus of providing information on a genomic sequence based personal marker
CN114566214B (en) Method for detecting genome deletion insertion variation, detection device, computer readable storage medium and application
CN113564266B (en) SNP typing genetic marker combination, detection kit and application
CN102154452B (en) Method and system for identifying cis-regulatory action and trans-regulatory action
CN112513292B (en) Method and device for detecting homologous sequences based on high-throughput sequencing
CN108304693B (en) Method for analyzing gene fusion by using high-throughput sequencing data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant