CN111326212A - Detection method of structural variation - Google Patents

Detection method of structural variation Download PDF

Info

Publication number
CN111326212A
CN111326212A CN202010098320.0A CN202010098320A CN111326212A CN 111326212 A CN111326212 A CN 111326212A CN 202010098320 A CN202010098320 A CN 202010098320A CN 111326212 A CN111326212 A CN 111326212A
Authority
CN
China
Prior art keywords
structural variation
read
sequence
structural
comparison
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010098320.0A
Other languages
Chinese (zh)
Other versions
CN111326212B (en
Inventor
伍林军
白健
茹兰兰
郑璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Herui Gene Technology Co ltd
Original Assignee
Fujian Herui Gene Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Herui Gene Technology Co ltd filed Critical Fujian Herui Gene Technology Co ltd
Priority to CN202010098320.0A priority Critical patent/CN111326212B/en
Publication of CN111326212A publication Critical patent/CN111326212A/en
Application granted granted Critical
Publication of CN111326212B publication Critical patent/CN111326212B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a method for detecting structural variation, which simultaneously supports the identification of the structural variation of RNA and DNA, and has the advantages of high sensitivity, high specificity, high speed and low resource consumption. The invention also provides a complete system or device, a computer readable storage medium and equipment established based on the method.

Description

Detection method of structural variation
Technical Field
The invention belongs to the technical field of gene detection, and particularly relates to a structural variation detection method and a related system, device, computer readable storage medium and equipment thereof.
Background
Structural variations originating within the genome include deletions, inversions, duplications within the same chromosome, and abnormal connections between different chromosomes. Whatever the event, the result is often that different portions of two genes are physically linked together, and after transcription, a new transcript, made up of portions of the transcripts of the two different genes, is obtained at the transcriptional level. The structural variant genes have important scientific significance in the occurrence and development process of cancer, and have extremely important medical value for researching the occurrence and development mechanism of tumor and treating and monitoring the tumor. For example, BCR- > ABL is widely existed in blood tumor, bladder cancer, lung cancer, malignant glioma and other tumors, FGFR3- > TACC3 is mainly appeared in bladder cancer, cervical squamous carcinoma and cervical adenocarcinoma, and EML4- > ALK is mainly appeared in lung cancer.
The structural variation detection technology based on the second generation sequencing technology means has been available for a long time, and whether structural variation occurs or not is judged mainly by sequencing a target region or a whole genome and analyzing a sequence obtained by sequencing.
The detection of DNA level is mainly to compare the data obtained by sequencing to the genome, collect the evidence which may support the occurrence of structural variation according to whether the compared reads are cracked or not, that is, the two parts of the reads are respectively compared to different positions of the genome, if the cracked comparison occurs, further analyze the two parts of the reads which are cracked and compared, and according to the compared positions of the two parts, the strand direction reversely deduces the cause of the occurrence of the structural variation, calculate the result of the occurrence of the structural variation. If the method is double-end sequencing, two reads can be generated from the same template through sequencing, an abnormal read pair supporting structural variation can be collected according to whether the comparison condition of the two paired reads is abnormal, under the normal condition, one of the two reads is compared with the positive strand of a genome, and the other is compared with the negative strand of the genome, and the lengths of the inserted segments are consistent when viewed from the transcription direction of nucleic acid, and are within a reasonable distribution range, if the two reads are from two parts of a structural variation gene, the direction is abnormal, or the length of an implied inserted segment is abnormal, but the various methods released at present have the defects of long calculation time, low sensitivity, high false positive, no annotation module and the like.
The detection of RNA level usually needs to compare two reference sequences of genome and transcriptome, compare the transcriptome, then map the coordinates to the genome, and then deduce the mechanism of occurrence of structural variation event by referring to the judgment method of DNA calculation through the comparison characteristics of the reading, and calculate the type and gene of the structural variation.
Disclosure of Invention
The invention aims to provide a structural variation detection method which simultaneously supports the structural variation recognition of RNA and DNA, and has the advantages of high sensitivity, high specificity, high speed and low resource consumption.
In order to achieve the above object, the present invention provides a method for detecting structural variation, the method comprising the steps of:
1) aligning the sequencing data to a reference genomic sequence or a reference master transcript sequence;
2) searching a normal comparison read (read), a read subjected to fracture comparison and an inconsistent comparison read pair;
3) classifying the read with fracture comparison and the read with inconsistency comparison;
4) grouping the fracture comparison reads of different types and the inconsistent comparison reads respectively, and classifying the reads supporting the same structural variation event into the same set;
5) for structural variation events determined by the fragmentation alignment reads, conserved sequences are formed by assembling the reads that support the structural variation events;
6) determining an exact breakpoint location based on the conserved sequence;
7) merging the structural variation supported by the fracture comparison read and the structural variation supported by the inconsistent comparison read, wherein merging refers to merging the structural variation events with similar breakpoints and the same type into the same structural variation event;
8) merging the structural variation supported by the fracture comparison read with similar breakpoints and the same type with the structural variation supported by the inconsistent comparison read;
9) the deletion of the conserved sequence can be completely and continuously matched with a segment of sequence on the genome or can generate a plurality of structural variation events which are aligned in a consistent way;
10) calculating the frequency of structural variation events.
In a specific embodiment, in step 1), if the sequencing data is DNA data, it is aligned to a reference genomic sequence; if the sequencing data is RNA data, it is aligned to the reference master transcript sequence.
In a specific embodiment, step 2) further comprises the steps of counting the insert length by the normal alignment reads, and calculating the main parameters of the insert length distribution; the main parameter is preferably a maximum, a minimum and/or a mean.
In a specific embodiment, in step 3), the classification of the reads for which fragmentation alignment occurs can be based on the following criteria:
whether the two parts that are aligned to the same chromosome, to different directions in the genome, and/or whether the splice location is upstream of the alignment location.
In one embodiment, in step 3), the reads with fragmentation alignment can be sorted according to the following criteria:
Figure BDA0002386038650000031
in one embodiment, in step 3), the classification of discordance alignments read pairs may be based on the following criteria:
whether aligned to the same chromosome, different orientations of the genome and/or insert sizes.
In one embodiment, in step 3), discordant alignment read pairs can be categorized according to the following criteria:
Figure BDA0002386038650000032
Figure BDA0002386038650000041
in one embodiment, SA tags may be used to find reads that are aligned for fragmentation and/or for inconsistent alignment reads.
In one embodiment, when identifying reads of a fragmentation alignment, if the partial alignment is elsewhere, no calculation is performed.
In one embodiment, the fragmentation alignment records from the same alignment reads can be considered as one entity when identifying the reads of the fragmentation alignment.
In a particular embodiment, in step 4), the grouping may be performed by cluster analysis.
Preferably, the criteria for clustering the fragmentation alignment reads include the type of structural variation, the name of the reference sequence for which the two parts of the fragmentation alignment read are aligned, the location of the first fragmentation point, and/or the location of the second fragmentation point.
Further, reads having the same type of structural variation, the same name of the aligned reference sequence, and a position of the first and second break points within m bases are used as a class, where m is a natural number within 30, preferably 10.
Further, if the read support number in a class is higher than a preset threshold (the threshold is selected from natural numbers above 1), the average breakpoint position is obtained by averaging the breakpoint position information of all the reads in the class.
Preferably, the criteria for clustering discordant aligned read pairs include the type of structural variation, the reference sequence name of the read alignment, the reference sequence name of the paired read alignment, the read alignment location, and/or the location of the paired read alignment.
Further, reads having the same alignment reference sequence name and having an alignment position difference within the maximum insert size range are regarded as one class.
Further, if the read support number in a class is higher than a preset threshold (the threshold is selected from natural numbers above 1, and is preferably 2), estimating the breakpoint position by using the determined breakpoint range of the read pair in the class; preferably, the breakpoint position is estimated by adopting a progressive method to continuously reduce the interval of the breakpoint position according to the comparison starting position and the comparison ending position of the read pair.
In a particular embodiment, in step 5), the assembly may be performed by multiple sequence alignments.
In a specific embodiment, step 5) further comprises a step of performing analytical reconstruction on the sequences involved in the assembly, wherein the analytical reconstruction comprises: extracting short insertion sequences near the break point and/or adjusting the sequence direction to make the read 5' end consistent with the reference sequence direction.
In a specific embodiment, in step 6), the conserved sequence is compared with the reference breakpoint sequence, and the precise breakpoint position is determined according to the position where the fracture alignment occurs; the reference breakpoint sequence comprises two portions, one from the reference sequence spanning one breakpoint of a structural variant event and the other from the reference sequence spanning the other breakpoint of the structural variant event.
In one embodiment, in step 7), more highly supported structural variant events are retained upon merging.
In a specific embodiment, in step 7), for the structural variation events supported by the fragmentation alignment reads, the conditions for merging include that the breakpoint distance is within the length of the homologous sequence of either one of the two or less than n bases, and is a structural variation of the same type, wherein n is a natural number within 30, preferably 10.
In one embodiment, in step 7), for the structural variation events supported by the discordant aligned read pairs, the combining is performed under conditions including a breakpoint distance within the maximum range of inserted sequence lengths and a structural variation of the same type.
In one embodiment, the conditions for merging in step 8) include that the breakpoint of one fracturealignment read supports a structural variant event is within the maximum insertion sequence length of the breakpoint of another discordant alignment read of the same type to the supported structural variant event.
In a specific embodiment, the alignment in step 9) may be performed using a BWT algorithm.
In a particular embodiment, in step 9), the set of records is recorded and/or filtered out if the conserved sequence is able to align completely to a contiguous region on the reference genomic sequence or reference master transcript; if the conserved sequence can be subjected to fragmentation alignment on a reference genome sequence or a reference major transcript sequence, the breakpoint position determined by the fragmentation alignment is within 10bp of the breakpoint position calculated before, and the fragmentation alignment exceeds 2 groups, recording and/or filtering the group of records; if the fracture ratio is 1 group, further accurate breakpoint is carried out according to the breakpoint position; if such fragmentation ratio is less than 1 group, no labeling and/or modification is performed.
In a specific embodiment, in step 10), the structural variant event frequency is calculated by reference type counts and variant counts of the two breakpoints of the structural variant event, the reference type being of a type identical to the reference genomic sequence or the reference master transcript sequence, and the variant being of a type identical to the structural variant event sequence; preferably, the reference pattern with higher support number on both sides of the breakpoint is used as the reference pattern count of the structural variation event.
Further, the structural variation event frequency is the number of molecules supporting structural variation/(number of molecules supporting structural variation + number of molecules supporting reference).
Preferably, if both reads of the same template support a certain structural variation event, it is counted as one molecule.
The detection method of the present invention may further comprise the step of annotating the fusion gene and/or outputting the detection result; preferably, the annotated information entry includes, but is not limited to, the name of the fusion gene, the breakpoint location of the fusion gene, exons, intron locations, whether the linkage is correct, structural variant sequences, and/or various types of structural variant support; preferably, the result output is in tab split mode or binary compression mode.
Structural variation detection software in the prior art usually does not have a fusion gene annotation module, and is usually performed by means of third-party annotation software, which cannot fully utilize data of structural variation calculation software, and annotation is often insufficient or even fails. In order to solve the problem, the software developed based on the method can be internally provided with an annotation module, and structural variation events can be directly annotated and released after the calculation is finished.
In order to achieve the purpose of personalized output, various standards can be set for the range of the structural variation report, such as the limit of support number, the limit of the reported range, the limit of the structural variation direction and the like, the report format can be divided into two types, one type is a tab dividing mode which is convenient for manual reading, and the other type is a binary compression format which is convenient for computer processing. And a partner tool module for further mining the result can be further arranged, and operations such as filtering, regenerating a report, combining a support background pool and the like are performed according to the existing result.
The output information can be selected as desired, including but not limited to: structural variation-related fusion genes, the number of molecules supporting structural variation, the number of molecules supporting a reference type, structural variation frequency, whether structural variation occurs in cosmc or ONCOKB, fusion gene status, distance between two genes involved in fusion, chromosomal information, breakpoint location, structural variation type, structural variation size, insertion sequence near breakpoint, fusion mask and/or conserved sequence near breakpoint, and the like.
The detection method of the invention may further comprise the step of sequencing and/or obtaining sequencing data.
The technical scheme of the invention can be used in various diagnosis and non-diagnosis application scenes of cancer. The technical scheme of the invention can be suitable for tumors of any stage, such as very early tumors, intermediate tumors and late tumors; preferably for use in early stage tumors or very early stage tumors.
It is another object of the present invention to provide a system or apparatus for detecting structural variations, the system or apparatus comprising:
1) a sequencing module and/or a sequencing data acquisition module; and
2) and the identification module is used for executing the detection method of the structural variation of the invention aiming at the sequencing data.
In a specific embodiment, the system or apparatus further comprises:
3) the annotation module is used for annotating the fusion gene; and/or
4) And the output module is used for outputting the detection result.
The present invention also provides a computer-readable storage medium including a stored computer program containing a program for executing the method for detecting a structural variation of the present invention.
The invention also provides an apparatus comprising a processor, a memory and a computer program stored in the memory, the computer program comprising a program for performing the method of detection of structural variations of the invention.
The beneficial effects of the invention at least comprise the following aspects:
(1) the detection method of the invention simultaneously supports the structural variation recognition of RNA and DNA, and has the advantages of extremely high sensitivity, high specificity, high speed and low resource consumption. Detection is achieved by only one read/read pair supporting the mutation, whereas the prior art typically requires at least 3 or 3 pairs. Such sensitivity can enable a qualitative leap because in practical sequencing applications, especially sequencing of tumor samples with high low frequency heterogeneity, it is often difficult to obtain enough (at least 3) reads that just cross the breakpoint, long enough to meet the requirements of prior art methods.
(2) Aiming at the structural variation calculation based on RNA sequencing, the invention adopts a novel comparison calculation method of the main transcript, thoroughly avoids the influence of an intron on comparison and calculation, and obtains excellent effect.
(3) By adopting the method, when the read of the fracture comparison and the read of the inconsistent comparison are classified, the classification can be finished only by accessing the same read once without depending on other comparison records of the read, the requirement on the comparison records is stricter, and if the fracture comparison part is compared to other places, the read is regarded as a repeated area and is not calculated.
(4) When the reads are classified, the methods in the prior art are often sorted according to the fracture positions, so that false negative and false positive are easily caused, and the method can effectively avoid the problem.
(5) By adopting the method, when the fracture comparison reads of different types and the inconsistent comparison reads are grouped, the complex calculation model is avoided, the calculation is simplified, and when the breakpoint is determined by the method in the prior art, complex nodes and trees are constructed and clustered, so that the calculation is complex and false positives are easy to generate.
(6) The method of the invention is adopted to assemble the sequence, and the accuracy of breakpoint judgment can be effectively improved.
(7) The invention uses the SA label to search the read with fracture comparison and/or the inconsistent comparison read pair, and can complete classification only by accessing the same read once, while in the prior art, two comparison records of one read need to be obtained simultaneously for calculation.
(8) The method of the invention has more strict requirements on comparison records, and when reading of fracture comparison is identified, if part of the reading is compared to other places, the reading is regarded as a repeated area and is not calculated.
(9) When identifying reads of fracture alignments, the fracture alignment record from the same alignment read is considered as one entity. In the prior art, the reading is usually ordered according to the fracture position, and whether the reading is from the same reading is not distinguished. The method of the invention helps to avoid the generation of false negatives and false positives.
Drawings
FIG. 1 is a flow chart of the method for detecting structural variation according to the present invention.
Detailed Description
Unless otherwise defined, all terms used herein have the meanings commonly used in the art, and all reagents used therein are commercially available in the art.
The term "reference sequence" in the present invention refers to a sequence in a DNA sequence file or an RNA sequence file of a transcriptome derived from a reference genome.
The term "normal alignment reads" in the present invention means that a read can be aligned to a continuous region on the genome continuously during the alignment of the read with the genome, and if the read has a length of m, the read can be aligned to a continuous closed region [ a, b ] on the genome, allowing for mismatches and short indels, but no splicing at the beginning and at the end.
The term "read with fragmented alignment" refers to a process of aligning a read with a genome, in which two parts of the read are aligned to discontinuous regions of the genome, for example, the read length is m, 10< n < m, the read [0, n ] is normally aligned to the genome [ a, b ], the read [ n +1, m ] is normally aligned to the genome [ c, d ], [ a, b ], [ c, d ] are either different chromosomes or the distance is greater than 100.
The term "discordant alignment read pair" refers to a pair of reads obtained by template double-end sequencing, when alignment occurs respectively, the alignment result does not support from the same normal template, and the normal template refers to a continuous region [ a, b ] on the genome.
With respect to the term "insertion sequence" in the present invention, the insertion sequence has different meanings under different contexts, and when calculating the template length distribution, the insertion sequence of a read refers to the length of the template, i.e. the absolute value of the difference between the minimum start position and the maximum end position of a pair of reads; in calculating structural variation, it is meant that there is an extra sequence at a site in the sample relative to the reference genome.
The terms "maximum insert" and "maximum insert sequence" in the present invention refer to the insert sequence (template) having the longest length.
In the invention, the term "support" indicates a causal relationship, A supports B, and indicates that a phenomenon indicates that B occurs.
The term "structural variation event" as used herein refers to a major change in the genome of a sample relative to a normal reference genomic sequence, and specifically includes inversion, insertion, deletion, replication, and transposition.
The term "conserved sequence" in the invention refers to a collection of a series of sequences, through multiple sequence alignment, in the alignment result, a base with a support number greater than three sequences and the highest support number is extracted as a conserved base at a certain position, and all conserved bases are connected according to the relative positions in the original sequence to obtain the conserved sequence.
The term "major transcript" as used herein refers to the longest, more thoroughly studied, and most clinically valuable transcript of several manually selected transcripts of the same gene.
The term "break point location" in the present invention refers to the start and stop location of a structural variation event, which is a location on the genome or transcriptome.
The term "reference type" in the present invention refers to a genotype identical to a reference genome or reference transcript.
The term "variant" as used herein refers to the same type of variation as a structural variation event.
The term "molecular count" as used herein means counting according to the number of templates.
The term "template" in the present invention refers to a DNA molecule used in the sequencing of an original sample.
The terms "smaller aligned position" and "larger aligned position" in the present invention are a pair of relative concepts, and are a measure of the aligned position of a read on a reference genomic sequence or a reference master transcript sequence, with the position closer to 5 'being smaller and the position closer to 3' being larger. The terms "smaller coordinates" and "larger coordinates" in the present invention mean that the coordinates on the same chromosome or transcript are smaller at positions close to 5 'and larger at positions close to 3'.
The terms "upstream splicing" and "downstream splicing" in the present invention mean that, when a sequence is aligned to a reference genomic sequence or a reference master transcript sequence, a portion of the sequence is completely aligned and the other portion is spliced, and if the splicing occurs upstream of the completely aligned portion, the upstream splicing is performed, and if the splicing occurs downstream of the completely aligned portion, the downstream splicing is performed.
The terms "large chromosome" and "small chromosome" in the present invention mean that chromosomes are mapped into natural numbers in the BAM file, and large chromosomes are mapped to larger natural numbers, and small chromosomes are mapped to smaller natural numbers.
The technical contents of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. It will be appreciated by those skilled in the art that the following examples are illustrative of the invention only and should not be taken as limiting the scope of the invention.
Example 1
15000 samples of solid tumor patients and blood tumor patients are collected, DNA is extracted for double-end second-generation capture library building sequencing, and RNA is extracted from part of samples for capture sequencing. Sequencing each obtained sample library, and comparing the DNA library with a human reference genome HG19 by using BWA comparison software; for RNA libraries, BWA alignment software was used to align the individually constructed transcriptome. The transcriptome was constructed as follows:
selecting one transcript as a main transcript for each gene, wherein the main transcript is required to be preferentially recorded by COSMIC and ONCOKIB databases, if not, the main transcript is required to be recorded in a main transcript database in UCSC, and if no main transcript is determined in the three databases, the longest one is used as the main transcript.
And comparing the obtained results, removing PCR repetition and correcting errors to generate a BAM file for detecting structural variation. The BAM file is used as an input file of the method, the number of inconsistent comparison reading segment pair seeds is minimum 2, the number of fracture comparison reading segment seeds is minimum 1, the operation memory is set to be 8G, the thread is set to be 8, and the cluster delivery analysis is carried out.
Example 2
Traversing the comparison result file, taking 100 ten thousand reads with normal comparison (the extraction standard is that no fracture comparison occurs, and the read pair conforms to the normal read and does not support any structural variation), counting the lengths of the inserted fragments, and calculating the main parameters, the maximum value, the minimum value and the mean value of the length distribution of the inserted fragments.
Traversing the comparison result file again, accessing all the reads, finding out all the read pairs with fracture comparison and abnormal comparison, and classifying the read pairs with fracture comparison and abnormal comparison according to the following characteristics: the reads of the fragmentation comparison are classified mainly according to the comparison conditions of the two parts of the fragmentation comparison, the classification standards comprise whether the same chromosome is compared, whether the chromosome is compared to different directions of the genome, and whether the shearing positions are all at the upstream of the comparison record, and the detailed classification table is shown in table 1; the classification criteria of the abnormally aligned reads mainly include whether the same chromosome is aligned, whether the genome is aligned in different directions, the length of the insert, and the detailed classification table is shown in table 2.
TABLE 1 fragmentation comparison readouts
Figure BDA0002386038650000101
Figure BDA0002386038650000111
TABLE 2 inconsistent alignment reads vs. Classification Table
Figure BDA0002386038650000112
And respectively carrying out cluster analysis on the different types of fracture comparison read and inconsistent comparison read, wherein the clustering aims to put the reads supporting the same structural variation event into a set to obtain more accurate breakpoints and reduce the random error of the breakpoints caused by comparison and sequencing errors as much as possible.
The method for clustering reads of the fragmentation alignment is as follows: sequencing the read of fracture alignment, wherein the sequencing standard comprises the structural variation type, the name of a reference sequence for two parts of the read of fracture alignment, the position of a first fracture point and the position of a second fracture point; reads with the same structural variation type, the same name of the alignment reference sequence and two fracture positions within 10 bases are taken as a class. If the read support number in a class is higher than a preset threshold (specifically, 1 is set in this embodiment, considering that probe design and sequencing interruption may not perfectly detect many reads crossing a breakpoint, the minimum threshold may be selected to be 1, which can achieve extremely high sensitivity), the average breakpoint position is obtained by averaging the position information of all read breakpoints in the class, a structural variation event supported by a broken read is generated, and the maximum error and the minimum error of the average breakpoint are used as a breakpoint error range.
The clustering method of the inconsistent alignment read pairs is as follows: ordering the inconsistent alignment reads, wherein the ordering standard comprises a structure variation type, a reference sequence name for read alignment, a reference sequence name for paired read alignment, a read alignment position and a paired read alignment position; regarding the read segment pairs with the same comparison reference sequence name and the difference of comparison positions within the maximum insert size range as a class; if the number of read pair supports in a class is higher than a preset threshold (in this embodiment, specifically set to 2, which can be adjusted according to the actual sequencing quality, the minimum is 1, which can achieve the maximum sensitivity, and it is recommended to set to 2, because the total number of read pair pairs for discordance comparison often cannot obtain a value higher than that in the initial evidence collection stage, and setting too low would result in a large amount of computation), the breakpoint position is estimated using the breakpoint range determined by the read pair in the class, and a structural variation event supported by the read pair for discordance comparison is generated.
For the structural variation event determined by the fragmentation alignment reads, a conserved sequence is obtained by assembling the reads supporting the structural variation event, and the assembly is carried out by a multiple sequence alignment technology, so that the conserved sequence can reduce the influence of small insertion loss or single nucleotide mutation caused by sequencing errors on breakpoint judgment. In the process of sequence assembly, the sequence participating in the assembly is analyzed and reconstructed, and the method mainly comprises the following steps: extracting short insertion series near the breakpoint, wherein the insertion sequences can influence the comparison with the reference sequence and influence the accurate judgment of the breakpoint; the sequence direction is adjusted, the 5' end of the read is ensured to be consistent with the reference sequence direction as far as possible, and the subsequent uniform processing and breakpoint calculation are facilitated.
After obtaining the conserved sequence, the conserved sequence is compared with a reference sequence, and an accurate breakpoint position is determined according to the position of the occurrence of the fragmentation comparison (if no proper fragmentation comparison is generated, the structural variation event is not a reliable event and is directly discarded without subsequent analysis, which is often false positive caused by sequencing errors or many comparison errors). After the breakpoint is obtained, a short sequence spanning 10 bases on the upstream and the downstream of the breakpoint is taken as a probe (when the frequency of the structural variation event is calculated subsequently, the original BAM file is traversed again, more reads spanning the breakpoint are extracted from the original BAM file, the reads may be seed reads which are subjected to fracture comparison and used for calculating the structural variation event, or broken parts are short, fracture comparison is not successfully performed in the process of generating the BAM through early comparison, and whether the broken reads support the corresponding structural variation event or not can be known through comparison with the probe), so that a more accurate calculation value is obtained for the frequency. For cases involving an insertion, two probes were retained, one containing the insertion and one without.
The reference sequence consists of two parts, one from the reference sequence spanning one breakpoint of a structural variant event and the other from the reference sequence spanning the other breakpoint of the structural variant event, the manner in which the two are joined (including which is at the 5 'end and which is at the 3' end, whether the forward sequence participates in the join or the reverse complement participates in the join) is determined by the alignment of the fragmented reads that correspond to the support of the structural variant event. Specifically, the reference sequence of the I-type inverted structure variant event has the upstream part from the reference sequence spanning the minor break point of the coordinate, the downstream part from the reference sequence spanning the major break point of the coordinate, the upstream forward participating in the linkage, and the downstream reverse complementary participating in the connection; the reference sequence of the type II inverted structure variant event, the upstream part is from the reference sequence spanning the breakpoint with smaller coordinates, the downstream part is from the reference sequence spanning the breakpoint with larger coordinates, the upstream part participates in the connection in a reverse complementary mode, and the downstream part participates in the connection in the forward direction; the reference sequence of the deletion type structure variation event, the upstream part is from the reference sequence spanning the minor breakpoint of the coordinate, the downstream part is from the reference sequence spanning the major breakpoint of the coordinate, and the upstream and downstream sides are connected by the forward sequence; the upstream part of the reference sequence of the replicative structure variation event is from a reference sequence with large crossing coordinates, the downstream part of the reference sequence is from a reference sequence with small crossing coordinates and the upstream and downstream sides of the reference sequence participate in connection by a forward sequence; the reference sequence for the insertion-type structural variation event is taken directly from a reference sequence plus strand sequence spanning the insertion site; two chromosome 5 'end and 5' end connected structural variation events, upstream part from across chromosome number major breakpoint reference sequence, downstream part from across chromosome number minor breakpoint reference sequence, upstream forward participation in the link, downstream reverse complementary participation in the connection; two chromosome 3 'end and 3' end connected structural variation events, upstream part from across chromosome number greater breakpoint reference sequence, downstream part from across chromosome number smaller breakpoint reference sequence, upstream in reverse complementary manner participating in connection, downstream in forward direction participating in connection; the structural variation event that the 5 'end of the large chromosome is connected with the 3' end of the small chromosome, the upstream part is from a reference sequence spanning a breakpoint with a larger chromosome number, the downstream part is from a reference sequence spanning a breakpoint with a smaller chromosome number, and the upstream part and the downstream part are connected in the forward direction; the structural variation event of the connection of the 3 'end of the large chromosome and the 5' end of the small chromosome, the upstream part is from the reference sequence of the breakpoint spanning the large chromosome number, the downstream part is from the reference sequence of the breakpoint spanning the small chromosome number, and the upstream part and the downstream part are connected in the forward direction.
After the breakpoint position refinement processing, a part of structural variation events are similar and the same in type, and are from the same structural variation event, only sequencing error or comparison error causes two events, and at this time, only the event with higher reserved support number needs to be merged. For structural variations supported by fragmented reads, merging occurs if the breakpoint distance is within the length of the homologous sequence of either or less than 10 bases. For structural variation supported by inconsistent reads in comparison, if the breakpoint distance is within the maximum insertion sequence length range and the structural variation of the same type, merging and reserving higher supporters.
Structural variant events determined by fracture alignment reads are often also supported by inconsistent alignment reads, and in the early calculation of this embodiment, the two events are separated, and are merged at this time in order to reinforce the evidence. The criterion for merging is that if the break point of a structural variation event supported by a fragmented read is just within the maximum insertion sequence length of the break point of a structural variation event supported by another read of the same type that is inconsistent, merging is performed. The structural variant events supported by discordant aligned read pairs that cannot be merged are directly retained.
For structural variation events supported by the broken alignment reads, a part of false positives come from the reads of the repetitive region, the reads cannot be definitely from which determined position of the genome, the conserved sequence of the reads can also be aligned to a plurality of positions of the genome, small-range alignment is adopted when a breakpoint is determined at the early stage, the accuracy is limited to a certain extent, and in order to further refine the breakpoint position, global alignment is carried out with a reference genome, and the structural variation events of the repetitive region are marked.
The alignment uses BWT algorithm alignment, if the conserved sequence can be completely and continuously matched to a segment of the genome, or multiple consistent alignments can occur, the structural variation event is not a true structural variation event. The operation scheme is as follows: if the conserved sequence can be completely and continuously matched on the reference genome sequence or the reference main transcript sequence, recording and filtering the reorganization record; if the conserved sequence can be subjected to fracture alignment on a reference genome sequence or a reference major transcript sequence, the position of a breakpoint determined by the fracture alignment is within 10bp of the position of the breakpoint calculated before, and the fracture alignment exceeds 2 groups, recording and filtering the group of records; if the fracture ratio is 1 group, further accurate breakpoint is carried out according to the breakpoint position; if such fragmentation ratio is less than 1 group, no labeling and modification is performed.
After obtaining the structural variation event with the breakpoints further refined, calculating the frequency of the structural variation event, and calculating the frequency of one structural variation event, wherein two related indexes are required to be calculated, one is the reference type counting of the two breakpoints of the structural variation event, and the other is the variant type counting of the structural variation event. In order to avoid the high count caused by counting the number of reads as the mutation in the conventional method, the present embodiment counts by molecules. The molecular counting is also adopted for counting the reference type, and the reference type with higher support number on two sides of the breakpoint is adopted as the reference type counting of the structural variation event.
For reads crossing the breakpoint and whether the read pair supports the reference pattern, it is determined according to the classification of the different structure variation reads (see table 1 and table 2), and the reference pattern count which does not support any variation is used.
The frequency of structural variation is equal to the number of molecules supporting structural variation divided by the sum of the number of molecules supporting structural variation and the number of molecules supporting the reference pattern.
Example 3
The structural variation detection method can simultaneously support the recognition of RNA and DNA structural variation, while the methods in the prior art can rarely support the recognition of the RNA and DNA structural variation simultaneously, such as DELLY and GeneFUSE which only support the recognition of DNA structural variation, and Tophasetfusion which only supports the recognition of RNA structural variation. To test the performance of the method of the invention, it was compared to a commercial software fusion map capable of supporting both DNA and RNA structural variation recognition.
269 positive samples were extracted from the sequencing data of example 1, with 1-2 fusions experimentally confirmed to be present. The results of structural variation detection by using a DNA library construction sequencing analysis method and fusion map software respectively show that the detection rate can reach 100% by using the method of embodiment 2 (see Table 3), positive fusion can be well detected by using the fusion map software, the detection rate is very low, the specificity is very high, fusion of the samples can be detected by using the fusion map software at 100%, but other fusion which is not credible (more than 3 fusions are detected by each sample of the fusion map on average) can be reported, and the method has higher specificity compared with the fusion map.
TABLE 3 analysis results of positive samples of the detection method of the present invention
Number of test samples 269
Detection rate 100%
Reporting fusion number/sample Less than 2 fusions/sample on average
Further tests were performed using a total of 18 samples with poor quality raw data and more complex libraries, with 18 difficult samples all identified by the method of example 2 and not identified in the fusion map software (see table 4). It can be seen that the method of the present invention provides better detection of samples with poor raw data quality or more complex libraries than commercial software fusion map.
TABLE 4 Complex or Low frequency Positive sample analysis results
Number of test samples 18 are provided with
Example 2 number of samples detected 18 pieces (100% detection)
Number of samples detected by fusion map 0 number of
The test scale is further enlarged, 8000 samples are extracted from the sequencing data of the embodiment 1 for testing, fusion map is firstly adopted for detection, and 98 credible fusion samples which are missed to be detected are generated. Samples of the fusion map that were missed were tested as in example 2, and all 98 missed samples were tested (table 5).
TABLE 5 results of analysis of clinical samples
Number of test samples 8000 pieces of
Fusion map missing check credible fusion 98 for
Example 2 detection of missing samples 98 for
In addition, the method of the present invention can obtain a high-sensitivity detection result, which can be based at least in part on the principle of the method of the present invention, and when finding an initial evidence of a possible structural variation event, it can perform calculation under the condition that only one fragmentation comparison read or only one pair of inconsistency comparison reads is supported, and the fusion map software as a comparison requires at least 2 fragmentation comparison reads to identify the structural variation, and another commonly used sourcing software DELLY requires at least 3 fragmentation comparison reads or 3 pairs of inconsistency comparison reads to start the structural variation calculation, so that the detection method of the present invention has higher sensitivity compared with the conventional method.
The aforementioned test items partially demonstrated the high specificity of the method of the present invention, and the test scale was further expanded, and the analysis was performed on all 15000 samples in example 1, and 1-2 authentic structural variants were obtained for each sample by the method of example 2, whereas a large number of structural variants of 5 or more and several detection results of hundreds of structural variants with DELLY were obtained in the detection results of the commercial software fusion map. Generally, 1-2 structural variations are credible, and the extra structural variation results are often unreliable or even clinically insignificant. This result indicates that the method of the present invention has higher specificity.
To test the speed and resource consumption of the process of the invention, 15000 samples of example 1 were tested using the method of example 2. The average analysis time is 3 minutes/sample (see table 6), the average memory usage is 6G/sample, the thread is 8/sample, and the typical open source software or commercial software takes at least half an hour or more and the resource consumption is more. It can be seen that the method of the present invention has a higher speed and a lower resource consumption than the prior art.
TABLE 6 resource consumption analysis
Test sample 15000 sample
Average memory consumption 6G/sample
Number of threads 8/sample
Mean time of analysis 3 minutes/sample
Finally, it is to be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for detecting structural variation, the method comprising:
1) aligning the sequencing data to a reference genomic sequence or a reference master transcript sequence;
2) searching a normal comparison read (read), a read subjected to fracture comparison and an inconsistent comparison read pair;
3) classifying the read with fracture comparison and the read with inconsistency comparison;
4) grouping the fracture comparison reads of different types and the inconsistent comparison reads respectively, and classifying the reads supporting the same structural variation event into the same set;
5) for structural variation events determined by the fragmentation alignment reads, conserved sequences are formed by assembling the reads that support the structural variation events;
6) determining an exact breakpoint location based on the conserved sequence;
7) merging the structural variation supported by the fracture comparison read and the structural variation supported by the inconsistent comparison read, wherein merging refers to merging the structural variation events with similar breakpoints and the same type into the same structural variation event;
8) merging the structural variation supported by the fracture comparison read with similar breakpoints and the same type with the structural variation supported by the inconsistent comparison read;
9) the deletion of the conserved sequence can be completely and continuously matched with a segment of sequence on the genome or can generate a plurality of structural variation events which are aligned in a consistent way;
10) calculating the frequency of structural variation events.
2. The detection method according to claim 1, wherein the step 2) further comprises the steps of counting the lengths of the inserts by the normal alignment reads and calculating the main parameters of the distribution of the lengths of the inserts; the main parameter is preferably a maximum, a minimum and/or a mean.
3. The detection method according to any one of claims 1-2, wherein the SA tags are used to find the reads that are aligned for fragmentation and/or the read pairs that are aligned for inconsistency.
4. The method according to any one of claims 1 to 3, wherein in step 8), the merging condition comprises that the breakpoint of one split aligned read supports a structural variant event is within the maximum insertion sequence length of the breakpoint of another non-aligned read of the same type to the supported structural variant event.
5. The detection method according to any one of claims 1 to 4, wherein in step 9), if the conserved sequence can be completely and continuously matched with a segment of sequence on the genome, the record of the group is recorded and/or filtered; if the conserved sequence can be subjected to fragmentation alignment on a reference genome sequence or a reference major transcript sequence, the breakpoint position determined by the fragmentation alignment is within 10bp of the breakpoint position calculated before, and the fragmentation alignment exceeds 2 groups, recording and/or filtering the group of records; if the fracture ratio is 1 group, further accurate breakpoint is carried out according to the breakpoint position; if such fragmentation ratio is less than 1 group, no labeling and/or modification is performed.
6. The method according to any one of claims 1 to 5, wherein the structural variant event frequency is calculated in step 10) by counting the number of reference types and the number of variant types at two breakpoints of the structural variant event, wherein the reference type is of the type corresponding to the reference genomic sequence or the reference master transcript sequence, and the variant type is of the type corresponding to the structural variant event sequence; preferably, the reference pattern with higher support number on both sides of the breakpoint is used as the reference pattern count of the structural variation event.
7. The detection method according to claim 6, wherein the structural variation event frequency is the number of molecules supporting structural variation/(number of molecules supporting structural variation + number of molecules supporting reference type).
8. A system or apparatus for detecting structural variations, the system or apparatus comprising:
1) a sequencing module and/or a sequencing data acquisition module; and
2) an identification module for performing a method of detecting a structural variation according to any one of claims 1-7 on sequencing data.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program containing a program for executing the method for detecting a structural variation according to any one of claims 1 to 7.
10. An apparatus comprising a processor, a memory, and a computer program stored in the memory, the computer program comprising a program for performing the method of detecting a structural variation according to any one of claims 1-7.
CN202010098320.0A 2020-02-18 2020-02-18 Structural variation detection method Active CN111326212B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010098320.0A CN111326212B (en) 2020-02-18 2020-02-18 Structural variation detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010098320.0A CN111326212B (en) 2020-02-18 2020-02-18 Structural variation detection method

Publications (2)

Publication Number Publication Date
CN111326212A true CN111326212A (en) 2020-06-23
CN111326212B CN111326212B (en) 2023-06-23

Family

ID=71172130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010098320.0A Active CN111326212B (en) 2020-02-18 2020-02-18 Structural variation detection method

Country Status (1)

Country Link
CN (1) CN111326212B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111863135A (en) * 2020-07-15 2020-10-30 西安交通大学 False positive structure variation filtering method, storage medium and computing device
CN112669902A (en) * 2021-03-16 2021-04-16 北京贝瑞和康生物技术有限公司 Method, computing device and storage medium for detecting genomic structural variation
CN114005490A (en) * 2021-12-30 2022-02-01 北京优迅医疗器械有限公司 Circulating tumor DNA fusion detection method based on second-generation sequencing technology
WO2022054178A1 (en) * 2020-09-09 2022-03-17 株式会社日立ハイテク Method and device for detecting structural mutation of individual genome
CN114464252A (en) * 2022-01-26 2022-05-10 深圳吉因加医学检验实验室 Method and device for detecting structural variation
CN114627967A (en) * 2022-03-15 2022-06-14 北京基石生命科技有限公司 Method for accurately annotating three-generation full-length transcript

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013059967A1 (en) * 2011-10-28 2013-05-02 深圳华大基因科技有限公司 Method for detecting micro-deletion and micro-repetition of chromosome
CN104298892A (en) * 2014-09-18 2015-01-21 天津诺禾致源生物信息科技有限公司 Detection device and method for gene fusion
CN106909806A (en) * 2015-12-22 2017-06-30 广州华大基因医学检验所有限公司 The method and apparatus of fixed point detection variation
CN110010193A (en) * 2019-05-06 2019-07-12 西安交通大学 A kind of labyrinth mutation detection method based on mixed strategy

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013059967A1 (en) * 2011-10-28 2013-05-02 深圳华大基因科技有限公司 Method for detecting micro-deletion and micro-repetition of chromosome
CN104298892A (en) * 2014-09-18 2015-01-21 天津诺禾致源生物信息科技有限公司 Detection device and method for gene fusion
CN106909806A (en) * 2015-12-22 2017-06-30 广州华大基因医学检验所有限公司 The method and apparatus of fixed point detection variation
CN110010193A (en) * 2019-05-06 2019-07-12 西安交通大学 A kind of labyrinth mutation detection method based on mixed strategy

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111863135A (en) * 2020-07-15 2020-10-30 西安交通大学 False positive structure variation filtering method, storage medium and computing device
CN111863135B (en) * 2020-07-15 2022-06-07 西安交通大学 False positive structure variation filtering method, storage medium and computing device
WO2022054178A1 (en) * 2020-09-09 2022-03-17 株式会社日立ハイテク Method and device for detecting structural mutation of individual genome
CN112669902A (en) * 2021-03-16 2021-04-16 北京贝瑞和康生物技术有限公司 Method, computing device and storage medium for detecting genomic structural variation
CN112669902B (en) * 2021-03-16 2021-06-04 北京贝瑞和康生物技术有限公司 Method, computing device and storage medium for detecting genomic structural variation
CN114005490A (en) * 2021-12-30 2022-02-01 北京优迅医疗器械有限公司 Circulating tumor DNA fusion detection method based on second-generation sequencing technology
CN114005490B (en) * 2021-12-30 2022-04-22 北京优迅医疗器械有限公司 Circulating tumor DNA fusion detection method based on second-generation sequencing technology
CN114464252A (en) * 2022-01-26 2022-05-10 深圳吉因加医学检验实验室 Method and device for detecting structural variation
CN114627967A (en) * 2022-03-15 2022-06-14 北京基石生命科技有限公司 Method for accurately annotating three-generation full-length transcript

Also Published As

Publication number Publication date
CN111326212B (en) 2023-06-23

Similar Documents

Publication Publication Date Title
CN111326212B (en) Structural variation detection method
CN109767810B (en) High-throughput sequencing data analysis method and device
CN111341383B (en) Method, device and storage medium for detecting copy number variation
CN112111565A (en) Mutation analysis method and device for cell free DNA sequencing data
WO2023115662A1 (en) Method for detecting variant nucleic acids
WO2021232388A1 (en) Method for determining base type of predetermined site in embryonic cell chromosome, and application thereof
CN110993023B (en) Detection method and detection device for complex mutation
CN110060733B (en) Second-generation sequencing tumor somatic variation detection device based on single sample
CN116312780B (en) Method, terminal and medium for detecting somatic mutation of targeted gene second-generation sequencing data
CN111243663B (en) Gene variation detection method based on pattern growth algorithm
CN110021346A (en) Gene Fusion and mutation detection methods and system based on RNAseq data
WO2018218787A1 (en) Third-generation sequencing sequence correction method based on local graph
CN116013419A (en) Method for detecting chromosome copy number variation
CN113035273A (en) Rapid and ultrahigh-sensitivity DNA fusion gene detection method
CN111584006A (en) Circular RNA identification method based on machine learning strategy
CN113096737B (en) Method and system for automatically analyzing pathogen type
CN111696622B (en) Method for correcting and evaluating detection result of mutation detection software
CN117275577A (en) Algorithm for detecting human mitochondrial genetic mutation sites based on second-generation sequencing technology
CN112102944A (en) NGS-based brain tumor molecular diagnosis analysis method
CN116665775A (en) Method, device and storage medium for detecting mitochondrial origin nuclear genome sequence
CN114067908B (en) Method, device and storage medium for evaluating single-sample homologous recombination defects
CN114627967A (en) Method for accurately annotating three-generation full-length transcript
WO2022087839A1 (en) Non-invasive prenatal genetic testing data-based kinship determining method and apparatus
CN117577182B (en) System for rapidly identifying drug identification sites and application thereof
CN114093428B (en) System and method for detecting low-abundance mutation under ctDNA ultrahigh sequencing depth

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant