WO2023185559A1 - 一种用于结构变异检测的方法、装置和存储介质 - Google Patents

一种用于结构变异检测的方法、装置和存储介质 Download PDF

Info

Publication number
WO2023185559A1
WO2023185559A1 PCT/CN2023/082917 CN2023082917W WO2023185559A1 WO 2023185559 A1 WO2023185559 A1 WO 2023185559A1 CN 2023082917 W CN2023082917 W CN 2023082917W WO 2023185559 A1 WO2023185559 A1 WO 2023185559A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
breakpoint
signals
insert size
reads
Prior art date
Application number
PCT/CN2023/082917
Other languages
English (en)
French (fr)
Inventor
刘涛
何俊义
苏亚男
李敏
吴永鑫
Original Assignee
深圳吉因加医学检验实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳吉因加医学检验实验室 filed Critical 深圳吉因加医学检验实验室
Publication of WO2023185559A1 publication Critical patent/WO2023185559A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Definitions

  • the present application relates to the technical field of bioinformatics, and in particular to a method, device and storage medium for structural variation detection.
  • Structural variation includes deletions, insertions, inversions, duplications, and translocations within the genome, as well as complex structural variations composed of these simple types.
  • Structural variant detection based on next-generation sequencing is generally based on the following strategies: methods based on paired-end alignment information (Paired End Mapping, PEM, also known as DP), methods based on shear reads (Split Read, SR), depth strategy-based method (Depth of Coverage, DoC) and assembly-based method (Assembly).
  • PEM paired-end alignment information
  • Split Read SR
  • DoC depth strategy-based method
  • Assembly-based method Assembly-based method
  • Most of the current mainstream detection methods are based on one or a combination of several of these strategies. For example, when detecting fusion breakpoints, many methods use the strategy of clustering SR signals, such as BreakSeek, an Indel breakpoint detection algorithm based on the Bayesian model. If the depth is high, the iteration time will be longer, and if the depth is low, the accuracy will be affected. Big impact.
  • the biggest flaw of traditional structural variation detection methods based on second-generation sequencing is the poor performance in identifying large or even very large structural variations. Most methods can only detect structural variations within a few thousand bp, including structural variations that exceed the insert size. The detection ability becomes worse.
  • the detection method inGap-sv based on depth differences identifies structural variations through DP, SR, SU and the number of normal read pairs, and uses depth information to correct the results.
  • Second-generation sequencing technology currently and will continue to dominate the market for a long time to come; therefore, how to solve the difficulty of accurate breakpoint detection based on second-generation sequencing data and the difficulty of detecting large-scale and inter-chromosomal structural variations The problem of identification is still the focus and difficulty of research in this field.
  • the purpose of this application is to provide a new method, device and storage medium for structural variation detection.
  • a first aspect of this application discloses a method for structural variation detection, including the following steps:
  • the data acquisition step includes obtaining the comparison file of the second-generation sequencing data of the object to be tested and its basic information.
  • the basic information includes insert size mean and standard deviation, insert size max, and read length;
  • the signal classification step includes extracting reads within the interval from the comparison file according to the set length, and dividing the abnormal reads into DP signal, SR signal and SU signal;
  • DP signal refers to insert size>insert size max or two paired reads.
  • the SR signal refers to the reads that are softly spliced
  • the SU signal refers to the reads in the read pair that only have one match to the reference sequence;
  • the DP signal cluster analysis step includes clustering the DP signals obtained in the signal classification step, and treating the reads with similar positions and the same direction as a DP signal cluster, and each cluster as a candidate for structural variation; among them, the reads with similar positions, that is, The distance is within the insert size max range, that is, within the normal insert size range;
  • the fusion breakpoint analysis step includes extracting SR signals and SU signals within the insert size max range of each cluster obtained in the DP signal cluster analysis step, adding the corresponding DP signals for assembly, and re-compiling the assembly results. Obtain fusion breakpoints, microhomology sequences and/or short template inserts;
  • the SR signal and SU signal are extracted from the insert size max range of each cluster.
  • the main reason is to consider that the breakpoint information of each cluster's preliminary analysis will be recorded in the DP clustering result. If there is an SR signal at the beginning or end of the cluster, , then set it as a left or right breakpoint.
  • the start and end positions of the DP cluster are determined, so the range of flank in this direction is inert size max-2 ⁇ read length, which can ensure that fetching SR and SU signals will not fetch redundant signals to the greatest extent; if one side of the DP cluster is broken If the point is determined by SR, the flank is set to 10bp. The reason is that when performing SR filtering, SRs less than 5bp are considered untrustworthy;
  • the SR signal analysis step includes searching for chimeric alignment (SA signal) from the SR signal obtained in the signal classification step, obtaining mutations that do not contain DP signals, extracting the corresponding DP signals and SU signals near the region where the mutations occur, and adding The corresponding reference sequences near this region are assembled, and the assembly results are re-aligned to obtain fusion breakpoints, microhomology sequences and/or short template insertion sequences; obtaining mutations that do not contain DP signals has two meanings, one is Obtain mutations that do not contain DP signals due to shorter sequences, and the second is other special mutations that do not contain DP signals; because some short sequence mutations also contain DP signals, and long sequence mutations do not necessarily all contain DP signals, and some Mutations do have certain specificities; therefore, this application uses SR signal analysis to retrieve smaller and special structural variations;
  • the shorter sequence means that the mutated sequence is shorter.
  • the insert size within the insert size mean + 3.96 ⁇ insert size standard deviation is considered to be the normal range. These insert sizes may also be included in normal read pairs. Mutation signals, such as read length mutations, may not generate enough DP signals. SR signal analysis is to make up for the lack of single DP signal detection and find some small SVs within the insert size max range;
  • the range of this flank is determined to find enough abnormal signals. To ensure that there is a greater possibility of assembling usable consensus sequences on the left and right sides;
  • Calculation and annotation steps including the results of the fusion breakpoint analysis step and the SR signal analysis step Calculate the mutation depth on the left and right sides of the fusion breakpoint, identify the structural variation type, and annotate each result by the left and right breakpoints left_bp and right_bp and the alignment direction of the left and right assembly fragments;
  • the annotation result merging and output step includes merging the annotation results of the calculation and annotation steps to combine the overlapping information generated by the dual recognition of the DP signal and the SR signal, and use the combined result as the structural variation detection result of the object to be tested.
  • the structural variation detection method of this application after extracting the abnormal signals, performs cluster analysis on the DP signals, and assembles and re-aligns the clustering results; then, in the part of retrieving the SR signals , paying attention to the SR signal where chimeric alignment occurs, ensuring that even if the DP signal is weak or some special mutations that do not contain a DP signal can be accurately captured and locally assembled; finally, micro-homology sequences are identified in the annotation part. , including the identification of small fragment insertions and short sequence tandem repeats, even in areas where the breakpoints are ambiguous, the most likely fusion breakpoints will be given, and the base sequence causing the breakpoints to be ambiguous will be given.
  • the multiple parallel design of this application is also a highlight, especially the parallelization of steps for processing larger amounts of data, which not only ensures accuracy but also ensures operational efficiency.
  • the method of this application has high precision, high efficiency and wide identification range for identifying structural variations, and provides a new solution and approach for structural variation detection.
  • the comparison file is a bam file.
  • insert size max is insert size mean + 3.96 ⁇ insert size standard deviation.
  • the length is set to 75k.
  • the set length of the signal classification step is 75k, which is not fixed; in practice, this application found that dividing the chromosome region into 75kbp blocks can make the parallel module fully utilize computer resources.
  • This set length is used as a parallel processing interval for separately extracting three types of signals, and can be set according to needs.
  • the recommended and default value is 75kbp.
  • the left and right sides of the fusion breakpoint refer to the left side of the left breakpoint and the right side of the right breakpoint.
  • the DP signals contained in the consensus sequence on the left and right sides are taken respectively.
  • the number of SR signals and SU signals is taken as the alt depth, the larger of the two depths on the left and right sides is taken as the mutation depth, and the number of DP signals, SR signals, SU signals and normal reads in the corresponding interval is taken as the overall depth.
  • left and right sides of the fusion breakpoint are also not a specific value range. They refer to the left side of the left breakpoint and the right side of the right breakpoint.
  • the length of these two areas is not fixed because it depends on the assembly result.
  • the length of the consensus In one implementation of this application, the number of DP+SR+SU signals contained in the left and right consensus sequences is taken as the alt depth, and the larger of the two depths is taken as the mutation depth. DP+SR+SU+normal in the corresponding interval The number of reads is used as the overall depth, so that the mutation frequency can be calculated.
  • the calculation and annotation steps annotate each result, specifically including identifying the type of structural variation based on the two direction information and the relative position information of breakpoint 1 and breakpoint 2; if the left and right breaks If the points are not on the same chromosome, it is an inter-chromosomal translocation; if the left and right sequences are in the same direction, it is a type 2 inter-chromosome translocation; if they are inconsistent, it is a type 1 inter-chromosome translocation; if the left and right breakpoints are on the same chromosome, and the left and right sequences are If the alignment direction is the same, it is a chromosome inversion; if the position of breakpoint 1 is before breakpoint 2 and breakpoint 1 is a reverse alignment, or if the position of breakpoint 1 is after breakpoint 2 and breakpoint 2 is a reverse alignment, It is a chromosome deletion; the rest is a chromosome duplication.
  • breakpoint 1 when arranging breakpoints, the left and right breakpoints will be determined based on the relative positions of the breakpoints; therefore, breakpoint 1 must be to the left of breakpoint 2, which is relatively small. , if they have different chromosomes, the smaller chromosome number is in front and the larger one is in the back. In other words, breakpoint 1 in this application is the left breakpoint, and breakpoint 2 is the right breakpoint.
  • the second aspect of this application discloses a device for structural variation detection, including a data acquisition module, a signal classification module, a DP signal cluster analysis module, a fusion breakpoint analysis module, an SR signal analysis module, a calculation and annotation module, and Annotation result merging and output module; details are as follows:
  • the data acquisition module includes the comparison file and its basic information used to obtain the second-generation sequencing data of the object to be tested.
  • the basic information includes insert size mean and standard deviation, insert size max, and read length;
  • the signal classification module includes extracting reads within the interval according to the set length from the comparison file, and dividing the abnormal reads into DP signal, SR signal and SU signal;
  • the DP signal refers to insert size>insert size max or two
  • the SR signal refers to the reads where soft shearing occurs
  • the SU signal refers to the reads in the read pair where only one of the reads matches the reference sequence;
  • the DP signal cluster analysis module is used to cluster the DP signals obtained by the signal classification module, and the reads with the same distance within the insert size max range and the same direction are regarded as a DP signal cluster, and each cluster is used as a candidate for structural variation. ;
  • the fusion breakpoint analysis module includes extracting SR signals and SU signals within the insert size max range of each cluster obtained from the DP signal cluster analysis module, adding the corresponding DP signals for assembly, and re-ratioing the assembly results. Yes, obtain fusion breakpoints, microhomology sequences and/or short template inserts;
  • the SR signal analysis module includes searching for chimeric alignments from the SR signals obtained by the signal classification module, obtaining mutations that do not contain DP signals, and extracting them near the area where the mutations occur, that is, within the insert size range on both sides of the SR signal interval.
  • the corresponding DP signal and SU signal are added near the region, that is, at least 10 bp on both sides of the SR signal interval, and the corresponding reference sequences are assembled.
  • the assembly results are re-aligned to obtain fusion breakpoints, micro-homology sequences and/or short template insertion sequence;
  • Calculation and annotation modules including results for the fusion breakpoint analysis module and SR signal analysis module Calculate the mutation depth on the left and right sides of the fusion breakpoint and identify the structural variation type. Annotate each result based on the left and right breakpoints left_bp and right_bp and the alignment direction of the left and right assembly fragments;
  • the annotation result merging and output module includes merging the annotation results of the calculation and annotation modules to merge the overlapping information generated due to the dual recognition of the DP signal and the SR signal, and use the combined results as structural variation detection of the object to be tested. result.
  • the device used for structural variation detection in this application actually implements each step of the method for structural variation detection in this application through each module; therefore, the specific limitations of each module can be referred to in this application.
  • the methods of structural variation detection are not described here.
  • the comparison file and insert size max in the data acquisition module, the set length in the signal classification module, and the method of annotating each result in the calculation and annotation module can all be used for structural variation detection with reference to this application. Methods.
  • the third aspect of the present application discloses a device for structural variation detection.
  • the device includes a memory and a processor; the memory includes a program for storing the program; the processor includes a program for executing the program stored in the memory to implement the application of the present application.
  • the fourth aspect of the present application discloses a computer-readable storage medium, which stores a program, and the program can be executed by a processor to implement the method for structural variation detection of the present application.
  • the method and device used for structural variation detection in this application utilizes DP signal clustering, combined with subsequent local assembly and re-alignment, to effectively reduce false positive signals within the cluster, and can obtain accurate fusion breakpoints on both sides of the structural variation. and the base sequences on both sides of the breakpoint; then use SR signal analysis to supplement the detection results based on the DP signal, so that the overall results can achieve a higher detection rate and accuracy.
  • the structural variation detection method of the present application can identify a variety of structural variation types including deletions, inversions, duplications, intrachromosomal translocations, interchromosomal translocations, etc., and provide the output of microhomology sequences and short template sequences near breakpoints.
  • Figure 1 is a flow chart of the structural variation detection method in the embodiment of the present application.
  • Figure 2 is a structural block diagram of a structural variation detection device in an embodiment of the present application.
  • Figure 3 is a schematic diagram of the left and right side clustering process in the DP signal clustering process in the embodiment of the present application.
  • This application is a method to assist in identifying cancer hotspot fusions, receiving WES or Panel data of various plasma and tissue samples, and using gene chips to perform capture sequencing.
  • the bam file obtained after data preprocessing of the sequencing off-machine data and the corresponding chip capture interval are used as input, and the chip capture interval is used for callingSV.
  • the chip capture interval is the interval of hot-spot mutations in various cancer types.
  • the capture depth within the interval can reach thousands or even tens of thousands, which can provide good sample sequence information within the interval while eliminating the impact of false positive sequences during the detection process. .
  • the capture interval is the focus of analyzing structural variation, in order not to miss potential variation signals and read sequences falling outside the interval, this application still detects and analyzes structural variation from the whole genome level.
  • This application is based on target interval sequence re-alignment to discover hot fusion breakpoints and hot spot gene sequences on both sides of the breakpoints, while maximizing the identification of microhomology sequences and short template sequence insertions on both sides of the breakpoints.
  • the method used in this application for structural variation detection includes data acquisition step 11, signal classification step 12, DP signal cluster analysis step 13, fusion breakpoint analysis step 14, and SR signal analysis step 15. , calculation and annotation step 16 and annotation result merging and output step 17.
  • data acquisition step 11 includes obtaining the comparison file of the second-generation sequencing data of the object to be tested and its basic information.
  • the basic information includes insert size mean and standard deviation, insert size max, and read length.
  • the comparison file is the bam file, and the insert size max is the insert size mean + 3.96 ⁇ insert size standard deviation.
  • Signal classification step 12 includes extracting reads within the interval from the comparison file according to the set length, and dividing the abnormal reads into DP signal, SR signal and SU signal;
  • DP signal refers to insert size > insert size max or two paired reads
  • the SR signal refers to the reads where soft shearing occurs
  • the SU signal refers to the reads in the read pair that only one matches the reference sequence.
  • reads in the interval are extracted in parallel from the bam file according to the length of 75k.
  • the DP signal clustering analysis step 13 includes clustering the DP signals obtained in the signal classification step, and taking the reads with a distance within the insert size max range and the same direction as a DP signal cluster, and each cluster as a candidate for structural variation.
  • Fusion breakpoint analysis step 14 includes extracting SR signals and SU signals within the insert size max range of each cluster obtained in the DP signal cluster analysis step, adding the corresponding DP signals for assembly, and re-compiling the assembly results. , to obtain fusion breakpoints, microhomology sequences, and/or short template inserts.
  • Step 15 of SR signal analysis includes finding chimeric alignments from the SR signals obtained in the signal classification step, obtaining mutations that do not contain DP signals, and extracting the corresponding insert size near the area where the mutation occurs, that is, within the insert size range on both sides of the SR signal interval.
  • Obtain the DP signal and SU signal add them near the region, that is, at least 10 bp on both sides of the SR signal interval, assemble the corresponding reference sequences, and re-align the assembly results to obtain fusion breakpoints, micro-homology sequences and/or short templates Insert sequence.
  • the calculation and annotation step 16 includes calculating the mutation depth on the left and right sides of the fusion breakpoint and identifying the structural variation type based on the results of the fusion breakpoint analysis step and the SR signal analysis step. It is assembled from the left and right breakpoints left_bp and right_bp and the left and right sides. The alignment direction of the fragments is annotated for each result.
  • the type of structural variation is determined based on the two direction information and the relative position information of breakpoint 1 and breakpoint 2; if the left and right breakpoints are not on the same chromosome, it is an inter-chromosomal translocation; among them, if the left and right sequence directions are consistent, it is Type 2 inter-chromosome translocation, if inconsistent, it is a type 1 inter-chromosome translocation; if the left and right breakpoints are on the same chromosome, and the left and right sequence alignment directions are consistent, it is a chromosome inversion; if breakpoint 1 is before breakpoint 2 And breakpoint 1 is a reverse alignment, or breakpoint 1 is after breakpoint 2 and breakpoint 2 is a reverse alignment, then it is a chromosome deletion; the rest is a chromosome duplication.
  • the annotation result merging and output step 17 includes merging the annotation results of the calculation and annotation steps to merge the overlapping information generated due to the dual recognition of the DP signal and the SR signal, and using the merged result as the structural variation detection result of the object to be tested. .
  • the program can be stored in a computer-readable storage medium.
  • the storage medium can include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., through the computer Execute this program to achieve the above functions.
  • the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the above functions can be realized.
  • the program can also be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk or a mobile hard disk, and can be downloaded or copied to save it. into the memory of the local device, or performs a version update on the system of the local device.
  • a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk or a mobile hard disk, and can be downloaded or copied to save it. into the memory of the local device, or performs a version update on the system of the local device.
  • this application proposes a device for structural variation detection, as shown in Figure 2, including a data acquisition module 21, a signal classification module 22, and a DP signal cluster analysis module. 23. Fusion breakpoint analysis module 24, SR signal analysis module 25, calculation and annotation module 26, and annotation result merging and output module 27.
  • the data acquisition module 21 includes the comparison file used to obtain the second-generation sequencing data of the object to be tested and its basic information.
  • the basic information includes insert size mean and standard deviation, insert size max, reads length.
  • the insert size max is the insert size mean + 3.96 ⁇ insert size standard deviation.
  • the signal classification module 22 includes extracting reads within the interval from the comparison file according to the set length, and dividing the abnormal reads into DP signal, SR signal and SU signal; DP signal refers to insert size>insert size max or two Paired reads fall on two different chromosomes.
  • the SR signal refers to the reads where soft shearing occurs.
  • the SU signal refers to the reads in the read pair that only one matches the reference sequence.
  • the DP signal cluster analysis module 23 includes clustering the DP signals obtained by the signal classification module.
  • the reads with a distance within the insert size max range and the same direction are regarded as a DP signal cluster, and each cluster is regarded as a structural variation. candidate.
  • the fusion breakpoint analysis module 24 includes extracting SR signals and SU signals within the insert size max range of each cluster obtained from the DP signal cluster analysis module, adding the corresponding DP signals for assembly, and reassembling the assembly results. Align to obtain fusion breakpoints, microhomology sequences and/or short template inserts.
  • the SR signal analysis module 25 includes searching for chimeric alignments from the SR signals obtained by the signal classification module, and obtaining mutations that do not contain DP signals, near the area where the mutations occur, that is, within the insert size range on both sides of the SR signal interval, Extract the corresponding DP signal and SU signal, add them near the region, that is, at least 10 bp on both sides of the SR signal interval, assemble the corresponding reference sequence, and re-align the assembly results to obtain fusion breakpoints, micro-homology sequences and/or or short template insertion sequences.
  • the calculation and annotation module 26 includes calculation of the mutation depth on the left and right sides of the fusion breakpoint and identification of structural variation types on the results of the fusion breakpoint analysis module and the SR signal analysis module. It consists of the left and right breakpoints left_bp and right_bp and the left and right breakpoints. Each result is annotated with the alignment direction of the side-assembled fragments.
  • the annotation result merging and output module 27 includes merging the annotation results of the calculation and annotation modules to merge the coincident information generated due to the dual recognition of the DP signal and the SR signal, and use the merged results as the structural variation of the object to be tested. Test results.
  • the device includes a memory and a processor; the memory is configured to store a program; the processor is configured to execute the program stored in the memory to Implement the following method: a data acquisition step, including obtaining a comparison file of the second-generation sequencing data of the object to be tested and its basic information, where the basic information includes insert size mean and standard deviation, insert size max, and read length; a signal classification step, including Extract reads within the interval according to the set length from the comparison file, and divide the abnormal reads into DP signal, SR signal and SU signal; the DP signal refers to insert size>insert size max or two paired reads fall within For reads on two different chromosomes, the SR signal refers to the reads where soft shearing occurs, and the SU signal refers to the reads in which only one of the read pairs matches the reference sequence; the DP signal clustering analysis step includes analyzing all the reads.
  • the DP signals obtained in the above signal classification step are clustered, and the distances within the insert size max range and the same direction are reads as a DP signal cluster, and each cluster serves as a candidate for structural variation;
  • the fusion breakpoint analysis step includes extracting SR signals and SU signals within the insert size max range of each cluster obtained from the DP signal cluster analysis step , plus the corresponding DP signal for assembly, and re-align the assembly results to obtain fusion breakpoints, micro-homology sequences and/or short template insertion sequences;
  • the SR signal analysis step includes the Search for chimeric alignment in the SR signal to obtain mutations that do not contain the DP signal.
  • calculation and annotation steps include analysis of the fusion breakpoints Steps and the results of the SR signal analysis step are used to calculate the mutation depth on the left and right sides of the fusion breakpoint and identify the structural variation type. Each result is calculated based on the two left and right breakpoints left_bp and right_bp and the comparison direction of the left and right assembly fragments.
  • Annotation; annotation result merging and output steps including merging the annotation results of the calculation and annotation steps to merge the coincident information generated due to the dual recognition of the DP signal and the SR signal, and using the merged result as the structure of the object to be measured Variation detection results.
  • the storage medium includes a program.
  • the program can be executed by a processor to implement the following method: a data acquisition step, including obtaining second-generation sequencing data of the object to be tested.
  • the comparison file and its basic information includes insert size mean and standard deviation, insert size max, and read length;
  • the signal classification step includes extracting reads in the interval according to the set length from the comparison file, And abnormal reads are divided into DP signal, SR signal and SU signal;
  • the DP signal refers to insert size>insert size max or the reads where two paired reads fall on two different chromosomes, and the SR signal refers to the occurrence of soft Sheared reads, the SU signal refers to only one read in the read pair that matches the reference sequence;
  • the DP signal clustering analysis step includes clustering the DP signals obtained in the signal classification step, and dividing the distance into insert size The reads within the max range and in the same direction are regarded as a DP signal
  • SR signal analysis steps include starting from Search for chimeric alignments in the SR signals obtained in the signal classification step to obtain mutations that do not contain DP signals.
  • each result is annotated by the left and right breakpoints left_bp and right_bp and the alignment direction of the left and right assembly fragments;
  • the annotation result merging and output step includes merging the annotation results of the calculation and annotation steps to The overlapping information generated by the dual recognition of the DP signal and the SR signal is merged, and the combined result is used as the structural variation detection result of the object to be tested.
  • the method and device used in this application for structural variation detection have high accuracy, high efficiency, and wide identification range.
  • the following keys play a key role in achieving these effects.
  • the first is the part of clustering DP signals after extracting abnormal signals.
  • the second is the process of assembling and re-aligning the clustering results.
  • the third is the part of retrieving SR. This step focuses on the occurrence of chimeric ratios.
  • the right SR signal ensures that even if the DP signal is weak or some special variants that do not contain a DP signal can be accurately captured and locally assembled, the last step is the recognition of micro-homology sequences in the annotation part, including small fragments Identification of insertions and short sequence tandem repeats, even in regions where the breakpoints are ambiguous, will give the most likely fusion breakpoints and the base sequence causing the breakpoint ambiguity.
  • the multiple parallel design of this application is also a highlight, especially the parallelization of steps for processing larger amounts of data, which not only ensures accuracy but also ensures operational efficiency. The following is a detailed elaboration of several key points in the identification method in this application.
  • the DP signal in this application is defined as a read pair in which the inserted fragment is not less than the maximum length of the inserted fragment or aligned to two different chromosomes. Excluding the influence of other extreme factors such as alignment errors, structural variations must occur in the areas where DP signals gather. Most traditional clustering methods are based on density clustering. The advantage is that it can cluster to the most enriched areas, but there are also many disadvantages. Obviously, it is easy to miss some key signals, especially when the sequencing depth is low, and the effect of density clustering is often poor. This application clusters DP signals based on the breadth strategy, which can ensure that useful DP signals are clustered into clusters to the greatest extent. The subsequent local assembly and re-alignment strategies ensure that the impact of false positive signals within clusters is minimized.
  • Clustering first clusters the reads on the left according to distance.
  • the specific method is as follows: read the reads one by one from the temporary BAM that stores the DP signal in parallel according to the chromosome, and then cluster them according to the distance. If the next read is further away from the boundary of the existing cluster, If the distance is less than insert size max, add it to the cluster. Then the corresponding right-hand reads in each cluster are clustered in the same way. If the clustered reads on the left are divided into multiple clusters by the right-hand reads, they are recorded as multiple clustering results (clique), as shown in Figure 3 Show. Figure 3 shows the situation where the reads on the left have been clustered into clusters during the clustering process, while the reads on the right are divided into two clusters. This application method records them as clique1 and clique2 respectively.
  • the process of filtering and re-alignment is a filtering process.
  • the method of this application obtains the most likely structural variation result by re-aligning the assembly results back to the corresponding reference sequence, including breakpoints and possible short template insertion sequences.
  • the assembly process can identify short microhomology sequences near fusion breakpoints and novel insertions within the insert size range. Through this step, the precise fusion breakpoints on both sides of the structural variation and the base sequences on both sides of the breakpoints can be obtained.
  • the SR retrieval strategy of this application method is another highlight. Since the DP signal is defined as a read pair containing larger size inserts or on different chromosomes, some small or special structural variations may still be missed due to depth and other reasons during the clustering process. Retrieve the SR strategy. This issue has been additionally addressed.
  • the SR retrieval step first processes the two alignment positions of the SA signals in all SR signals separately, and searches for the DP signals and SU signals that may exist in the nearby areas on the left and right sides of the reference sequence, respectively. And intercept the reference sequence in the corresponding region, assemble several signals, and identify the most likely result from the assembly results as the fusion breakpoint result. This process is similar to the DP signal clustering and assembly process.
  • the SR retrieval process is a supplement to the detection process based on the DP signal, which can ensure the detection of certain special structural variations and achieve a higher detection rate and accuracy in the overall results.
  • This application has excellent performance in detecting structural variations in large hotspot areas, including translocations between chromosomes.
  • bam format file Binary file of SAM format file.
  • the SAM file is a fixed-format alignment result representation file, which is generally generated by comparing sequencing result data with reference sequences.
  • DP signal Discordant Pair, in second-generation paired-end sequencing, insert size>insert size max or the two read alignment positions are far apart or on different chromosomes.
  • SR signal Split Reads, reads where shearing has occurred. A read is divided into two parts and mapped to different positions respectively, that is, reads where soft shearing has occurred.
  • insert size The size of the fragment in paired-end sequencing.
  • the structural variation detection method in this example is as follows:
  • Data acquisition steps include obtaining the bam file of the second-generation sequencing data of the object to be tested and calculating bam Basic information of the file, insert size mean and standard deviation, insert size max (insert size mean+3.96*insert size std), read length;
  • Signal classification step Extract reads in the interval in parallel according to the length of 75k from the bam file, and divide the abnormal reads into four signals: DP (insert size>insert size max or two paired reads fall on two different chromosomes ), SR (soft-cut reads), SU (only one read pair matches the reference sequence), which are placed in a temporary file after extraction;
  • DP signal cluster analysis step cluster the DP signals extracted in step 2, and find DP signal clusters (clique) with similar positions and the same direction. Each cluster is used as a candidate for structural variation; among them, the DP signals with similar positions are The distance is within the insert size max range;
  • Fusion breakpoint analysis step extract SR and SU signals from each clique of the clustering result in step 3, add the DP signal for assembly, and re-compare the assembly results to find the fusion breakpoints and micro-identities.
  • Source sequence and other short template insertion among them, near each clique, that is, within the insert size max range of each cluster;
  • SR signal analysis step look for SA signals (chimeric alignment) from the SR signals extracted in step 2 to find those mutations that do not contain DP signals, and extract the corresponding DP and SU signals near the area where SR occurs. , add the corresponding reference sequence near the position for assembly, and re-align the assembly results to find fusion breakpoints and possible microhomology and short template insertion sequences; specifically, extract the phase within the insert size range on both sides of the SR signal interval
  • Corresponding to the obtained DP signal and SU signal add the SR signal interval and the reference sequence corresponding to at least 10 bp on both sides for assembly; among them, the variations that do not contain the DP signal mainly include variations that do not contain the DP signal due to short sequences, and other factors. Long sequence mutations that do not include DP signals for special reasons;
  • step 4 and step 5 use each result of step 4 and step 5 as a structural variation candidate, calculate the mutation depth on the left and right sides of the fusion breakpoint, identify the structural variation type, etc., from the left and right breakpoints left_bp and right_bp Annotate each result with the alignment direction of the left and right assembled fragments;
  • the annotation result merging and output step is to merge the annotation results to merge the overlapping information generated by the dual recognition of DP and SR, and output the final structural variation results.
  • the left and right sides of the fusion breakpoint refer to the left side of the left breakpoint and the right side of the right breakpoint.
  • the number of DP signals, SR signals and SU signals contained in the consensus sequence on the left and right sides are respectively taken as the alt depth.
  • the larger of the two depths on the left and right sides is used as the mutation depth, and the number of DP signal, SR signal, SU signal and normal reads in the corresponding interval is used as the overall depth.
  • the annotation specifically includes identifying the type of structural variation based on the two direction information and the relative position information of breakpoint 1 and breakpoint 2; if the left and right breakpoints are not on the same chromosome, it is an interchromosomal translocation; where, If the left and right sequence directions are consistent, it is a type 2 inter-chromosome translocation; if they are inconsistent, it is a type 1 inter-chromosome translocation; if the left and right breakpoints are on the same chromosome, and the left and right sequence alignment directions are consistent, it is a chromosome inversion; if the breakpoint is 1 If the position is before breakpoint 2 and breakpoint 1 is a reverse alignment, or if the breakpoint 1 position is after breakpoint 2 and breakpoint 2 is a reverse alignment, then it is a chromosome deletion; the rest is a chromosome duplication.
  • the structural variation detection method in this example can identify a variety of structural variation types including deletions, inversions, duplications, intrachromosomal translocations, interchromosomal translocations, etc., and provide the output of microhomology sequences and short template sequences near breakpoints.
  • this example further developed the corresponding software ncsv2 as the structural variation detection device in this example.
  • the device uses the sample's sorted BAM file, comparison information, and hot spots of the sequencing chip of the sample. Regional information hotregion file, etc. as input, you can directly obtain all the structural variation information of the sample and store it in the resulting csv file.
  • Each mutation information contains the mutation type of the mutation, two mutation breakpoint location information, and two breakpoints. The type and number of side genes, mutation frequency, the number of reads supporting the mutation in DP, SR, and SU, and the IGV map link of the mutation.
  • the structural variation detection device in this example can efficiently and accurately detect the fusion breakpoint and the base sequences on both sides of the breakpoint, and can identify a variety of structural variation types, such as deletions, inversions, duplications, intrachromosomal translocations, Interchromosomal translocations, etc., and provides microhomology sequence and short template sequence output near breakpoints.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Signal Processing (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本申请公开了一种用于结构变异检测的方法、装置和存储介质。本申请方法包括,获取比对文件,从比对文件中按照设定长度提取区间内reads,将异常reads分成DP信号、SR信号和SU信号;对DP信号进行聚类,每个簇作为一结构变异候选,对每个簇进行局部组装和重比对;从SR信号中寻找嵌合比对,进行组装和重比对;对两个重比对结果进行融合断点左右两侧突变深度计算、结构变异类型识别。本申请方法利用DP信号聚类和组装重比对,降低簇内假阳性信号;利用SR信号分析进行补充,使整体结果检出率和精度更高。本申请方法可识别缺失、倒位、重复、染色体内易位、染色体间易位等结构变异,并提供断点附近微同源序列和短模板序列输出。

Description

一种用于结构变异检测的方法、装置和存储介质 技术领域
本申请涉及生物信息学技术领域,特别是涉及一种用于结构变异检测的方法、装置和存储介质。
背景技术
结构变异(Structural Variation,SV)包括基因组内部的缺失、插入、倒位、重复、易位,以及这些简单类型组成的复杂结构变异。基于二代测序(the Next-generation Sequence technology)数据的结构变异检测方法研究经过十多年的发展,已越来越趋于成熟,但有一些难题仍然无法彻底攻克;这其中就包括精确断点、较大尺寸及染色体间结构变异的识别问题等。经过近几年生物信息领域的飞速发展,针对这些问题的各种检测方法也被广泛提出,例如改用三代长读长方法、改用其他识别逻辑或更为精确的测序数据等,这些方法的共同点是从另外的角度切入以避开短读长和短插入片段带来的局限性问题;并没有真正解决基于二代测序数据的结构变异检测存在的问题。所以,目前基于二代测序的传统检测算法,仍然没有能够适应较广的方法。
各种癌症一直是医学界难以攻克的难题,近年来生物信息技术的发展,使我们能够从基因层面了解各癌种的序列及作用原理。癌症一般伴随着基因序列的改变,所以结构变异的精确识别,特别是发生在大型结构变异以及高重复区域的变异,是攻克癌症难关的一个重要基础。而目前的检测方法虽然众多,但在检测较大型结构变异的策略上不够灵敏,且二代测序的固有特点给识别这些变异带来了诸多困难。
测序技术的发展虽然极大地促进了检测方法的发展,但仍然存在一些至今都没有很好解决办法的难题。例如,测序结果中N序列的存在、测序错误、高重复区域等使得结构变异检测的难度大大增加。基于二代测序的检测方法局限性一是在于测序read的长度,二是模板的长度;许多方法受到这些限制而只能检测模板长度以内的变异,往往在几百bp以内,更大的变异则需从头组装等消耗资源巨大的策略,而组装结果的多样性也使这些方法难以确定序列原始内容。另外一个比较难以解决的问题是测序深度,全基因组测序的深度一般在100×以内,这样的深度无法保证聚簇型方法的准确性。
基于二代测序的结构变异检测一般基于以下几种策略:基于双端比对信息的方法(Paired End Mapping,PEM,又称DP)、基于剪切读数的方法(Split Read, SR)、基于深度策略的方法(Depth of Coverage,DoC)以及基于组装的方法(Assembly)。目前主流的几种检测方法大多都基于其中一种或几种策略的结合。例如检测融合断点时,许多方法都采用聚类SR信号的策略,例如BreakSeek,一种基于贝叶斯模型的Indel断点检测算法,深度高则迭代时间较长,深度低则精确性受到较大影响。
传统的基于二代测序的结构变异检测方法最大的缺陷便是识别大型乃至超大型结构变异的表现较差,大多数方法只能检测几千bp以内的结构变异,其中超过insert size尺寸的结构变异检测能力变差。例如基于深度差异的检测方法inGap-sv,通过DP、SR和SU及正常read pair数量来识别结构变异,并用深度信息对结果进行校正,无法识别较为复杂或者跨染色体的结构变异;manta、SV-aba这类组装类方法,难以在高重复区域发挥作用,且耗时较长;Pindel、Delly等经典方法在检测小indel有较好的效果,一旦检测超出模板片段长度的结构变异,则表现不佳。另外一个传统方法较为难以攻克的难题就是,要想获得比较精确的融合断点,一般需要进行聚类或者局部组装,这是一个容易出现差异性的地方。
二代测序技术目前并且在此后可以预料到的很长时间内都仍将占据市场主导位置;因此,如何解决基于二代测序数据难以进行精确断点检测,难以进行较大尺寸及染色体间结构变异识别的问题,仍然是本领域的研究重点和难点。
发明内容
本申请的目的是提供一种新的用于结构变异检测的方法、装置和存储介质。
为了实现上述目的,本申请采用了以下技术方案:
本申请的第一方面公开了一种用于结构变异检测的方法,包括以下步骤:
数据获取步骤,包括获取待测对象二代测序数据的比对文件及其基本信息,基本信息包括insert size均值和标准差、insert size max、reads长度;
信号分类步骤,包括从比对文件中按照设定长度提取区间内的reads,并将异常reads分成DP信号、SR信号和SU信号;DP信号是指insert size>insert size max或者两个配对read落在两个不同的染色体上的reads,SR信号是指发生软剪切的reads,SU信号是指read pair中只有一条匹配到参考序列的reads;
DP信号聚类分析步骤,包括对信号分类步骤获得的DP信号进行聚类,将位置相近、方向相同的reads作为一个DP信号簇,每个簇作为一个结构变异的候选;其中,位置相近,即距离在insert size max范围内,也就是在正常insert size范围内;
融合断点分析步骤,包括从DP信号聚类分析步骤获得的每个簇的insert size max范围内提取SR信号和SU信号,再加上相应的DP信号进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列;
其中,从每个簇的insert size max范围内提取SR信号和SU信号,主要是考虑,在DP聚类结果中会记录每一簇初步分析的断点信息,如果簇中有SR信号开头或者结尾,则将其置为left或者right断点,如果不存在SR信号,则将DP簇的开始和结尾位置作为left和right断点;在提取SR和SU信号阶段,如果DP簇的断点是通过DP簇的左右结尾确定的,则这个方向flank的范围为inert size max-2×read length,可以最大程度保证fetch到SR和SU信号又不至于fetch到冗余信号;如果DP簇的某侧断点是通过SR确定的,则flank置为10bp,原因是在进行SR过滤的时候,小于5bp的SR认为不可信;
SR信号分析步骤,包括从信号分类步骤获得的SR信号中寻找嵌合比对(SA信号),获得不包含DP信号的变异,在变异发生的区域附近提取相对应得DP信号和SU信号,加入该区域附近对应的参考序列进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列;获得不包含DP信号的变异,有两层含义,一是获得因序列较短而不包含DP信号的变异,二是其他特殊的不包含DP信号的变异;因为有的短序列突变也包含DP信号,而长序列突变也不一定都包含DP信号,有的突变确实存在一定的特殊性;因此,本申请通过SR信号分析重找回以发现较小和特殊的结构变异;
其中,序列较短是指,突变的序列较短,一般来说认为在insert size均值+3.96×insert size标准差之内的insert size为正常范围,在这些insert size正常的reads pair中也可能包含突变信号,比如read length长度的突变,这种突变可能并没有产生足够多的DP信号,SR信号分析就是为了弥补单DP信号检测的不足,发现一些在insert size max范围内的小型SV;
变异发生的区域附近是指,SR信号已经确定的情况下,SR信号会确定一个区间,在区间两侧flank=insert size范围内进行fetch,定这个flank的范围是为了找到足够多的异常信号,以保证在左右两侧更大可能性组装出可用的consensus序列;
加入该区域附近对应的参考序列进行组装是指,SR信号已经确定的情况下,SR信号会确定一个区间,在区间两侧flank=10bp范围内进行fetch reference序列,以增加组装的成功率,即加入SR信号区间及其两侧至少10bp对应的参考序列进行组装;
计算和注释步骤,包括对融合断点分析步骤和SR信号分析步骤的结果进行 融合断点左右两侧的突变深度计算、结构变异类型识别,由左右两个断点left_bp和right_bp以及左右侧组装片段的比对方向对每一个结果进行注释;
注释结果合并和输出步骤,包括对计算和注释步骤的注释结果进行合并,以合并因为DP信号和SR信号双重识别而产生的重合信息,将合并后的结果作为待测对象的结构变异检测结果。
需要说明的是,本申请的结构变异检测方法,在提取完异常信号之后,对DP信号进行聚类分析,并对聚类结果进行组装和重比对;然后,在重找回SR信号的部分,关注发生嵌合比对的SR信号,保证了即使是在DP信号较弱或者某些不包含DP信号的特殊变异也能被准确捕捉并进行局部组装;最后在注释部分的微同源序列识别,包含对小片段插入和短序列串联重复的识别,即使是对断点模糊不清的区域,也会给出可能性最大的融合断点,并给出造成断点模糊的碱基序列。此外,本申请的多处并行设计也是一大亮点,特别是在处理较大量数据的步骤的并行,在保证了精确性的同时也保证了运行的效率。本申请的方法识别结构变异精度高、效率高、识别范围广,为结构变异检测提供了一种新的方案和途径。
本申请的一种实现方式中,数据获取步骤中,比对文件为bam文件。
优选的,insert size max为insert size均值+3.96×insert size标准差。
本申请的一种实现方式中,信号分类步骤中,设定长度为75k。
需要说明的是,信号分类步骤的设定长度为75k,该值并非固定不变的;本申请在实践中发现,将染色体区域划分为75kbp的块可以使并行模块很充分的利用计算机资源。该设定长度作为单独提取三种信号的并行处理区间,可由根据需求进行设定,推荐及默认为75kbp。
本申请的一种实现方式中,计算和注释步骤,融合断点左右两侧是指,左断点的左侧和右断点的右侧,分别取左右两侧consensus序列中包含的DP信号、SR信号和SU信号的数量作为alt深度,取左右两侧两个深度中较大的一个作为突变深度,对应区间内DP信号、SR信号、SU信号和正常reads数量作为整体深度。
可以理解,融合断点左右两侧同样并非一个具体的取值范围,其指左断点的左侧和右断点的右侧,这两个区域的长度并不固定,因为它取决于组装结果consensus的长度。本申请的一种实现方式中,分别取左右consensus序列中包含的DP+SR+SU信号数量作为alt深度,取两个深度中较大的一个作为突变深度,对应区间内DP+SR+SU+正常reads数量作为整体深度,这样可以计算出突变频率。
本申请的一种实现方式中,计算和注释步骤,对每一个结果进行注释,具体包括,根据这两个方向信息和断点1及断点2的相对位置信息判别结构变异类型;如果左右断点不在同一染色体,则为染色体间易位;其中,如果左右序列方向一致则为2型染色体间易位,若不一致则为1型染色体间易位;如果左右断点在同一染色体,且左右序列比对方向一致,则为染色体倒置;若断点1的位置在断点2之前并且断点1为反向比对,或断点1位置在断点2之后且断点2反向比对,则为染色体缺失;其余则为染色体重复。
需要说明的是,本申请中,在排列断点的时候,会根据断点的相对位置来确定左右断点;因此,断点1一定是在断点2左侧的,也就是相对较小的,如果不同染色体的话,染色体号小的在前,大的在后。也就是说,本申请的断点1即左断点,断点2即右断点。
本申请的第二方面公开了一种用于结构变异检测的装置,包括数据获取模块、信号分类模块、DP信号聚类分析模块、融合断点分析模块、SR信号分析模块、计算和注释模块以及注释结果合并和输出模块;具体如下:
数据获取模块,包括用于获取待测对象二代测序数据的比对文件及其基本信息,基本信息包括insert size均值和标准差、insert size max、reads长度;
信号分类模块,包括用于从比对文件中按照设定长度提取区间内的reads,并将异常reads分成DP信号、SR信号和SU信号;所述DP信号是指insert size>insert size max或者两个配对read落在两个不同的染色体上的reads,所述SR信号是指发生软剪切的reads,所述SU信号是指read pair中只有一条匹配到参考序列的reads;
DP信号聚类分析模块,包括用于对信号分类模块获得的DP信号进行聚类,将距离在insert size max范围内、方向相同的reads作为一个DP信号簇,每个簇作为一个结构变异的候选;
融合断点分析模块,包括用于从DP信号聚类分析模块获得的每个簇的insert size max范围内提取SR信号和SU信号,再加上相应的DP信号进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列;
SR信号分析模块,包括用于从信号分类模块获得的SR信号中寻找嵌合比对,获得不包含DP信号的变异,在变异发生的区域附近,即SR信号区间两侧insert size范围内,提取相对应得DP信号和SU信号,加入该区域附近,即SR信号区间两侧至少10bp,对应的参考序列进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列;
计算和注释模块,包括用于对融合断点分析模块和SR信号分析模块的结果 进行融合断点左右两侧的突变深度计算、结构变异类型识别,由左右两个断点left_bp和right_bp和左右侧组装片段的比对方向对每一个结果进行注释;
注释结果合并和输出模块,包括用于对计算和注释模块的注释结果进行合并,以合并因为DP信号和SR信号双重识别而产生的重合信息,将合并后的结果作为待测对象的结构变异检测结果。
需要说明的是,本申请用于结构变异检测的装置,实际上就是通过各模块分别实现本申请用于结构变异检测的方法中的各步骤;因此,各模块的具体限定可以参考本申请用于结构变异检测的方法,在此不累述。例如,数据获取模块中的比对文件、insert size max,信号分类模块中的设定长度,以及计算和注释模块中对每一个结果进行注释的方法等,都可以参考本申请用于结构变异检测的方法。
本申请的第三方面公开了一种用于结构变异检测的装置,该装置包括存储器和处理器;存储器包括用于存储程序;处理器包括用于通过执行存储器存储的程序以实现本申请用于结构变异检测的方法。
本申请的第四方面公开了一种计算机可读存储介质,该存储介质中存储有程序,该程序能够被处理器执行以实现本申请用于结构变异检测的方法。
由于采用以上技术方案,本申请的有益效果在于:
本申请用于结构变异检测的方法和装置,利用DP信号聚类,结合后续的局部组装和重比对,有效的降低了簇内假阳性信号,并且可获得结构变异两侧的精确融合断点和断点两侧的碱基序列;再利用SR信号分析,对以DP信号为基础的检测结果进行补充,使整体结果达到更高的检出率和精度。本申请的结构变异检测方法可以识别包括缺失、倒位、重复、染色体内易位、染色体间易位等多种结构变异类型,并提供断点附近的微同源序列和短模板序列输出。
附图说明
图1是本申请实施例中结构变异检测方法的流程框图;
图2是本申请实施例中结构变异检测装置的结构框图;
图3是本申请实施例中DP信号聚簇过程中的左右侧聚簇过程的示意图。
具体实施方式
下面通过具体实施方式结合附图对本申请作进一步详细说明。在以下的实施方式中,很多细节描述是为了使得本申请能被更好的理解。然而,本领域技术人员可以毫不费力的认识到,其中部分特征在不同情况下是可以省略的,或 者可以由其他装置、材料、方法所替代。在某些情况下,本申请相关的一些操作并没有在说明书中显示或者描述,是为了避免本申请的核心部分被过多的描述所淹没,而对于本领域技术人员而言,详细描述这些相关操作并不是必要的,根据说明书中的描述以及本领域的一般技术知识即可完整了解相关操作。
传统的基于二代测序的结构变异检测方法最大的缺陷便是识别大型乃至超大型结构变异的表现较差,并且,难以进行精确断点检测。
本申请是一种辅助识别癌症热点融合的方法,接收各种血浆和组织样本的WES或Panel数据,使用基因芯片进行捕获测序。将测序下机数据经过数据预处理获得的bam文件和对应的芯片捕获区间作为输入,芯片捕获区间用于callingSV。芯片捕获区间是各癌种热点突变的区间,区间内的捕获深度可以达到数千乃至万级,可以很好的给出区间内的样本序列信息,同时排除掉检测过程中假阳性序列造成的影响。虽然捕获区间是分析结构变异的重点位置,但为了不漏掉潜在的变异信号和落在区间外的reads序列,本申请仍从全基因组层面对结构变异进行检出和分析。本申请基于目标区间序列重比对以发现热点融合断点,以及断点两侧的热点基因序列,同时在断点两侧最大限度的识别微同源序列及短模板序列插入。
具体的,本申请用于结构变异检测的方法,如图1所示,包括数据获取步骤11、信号分类步骤12、DP信号聚类分析步骤13、融合断点分析步骤14、SR信号分析步骤15、计算和注释步骤16和注释结果合并和输出步骤17。
其中,数据获取步骤11,包括获取待测对象二代测序数据的比对文件及其基本信息,基本信息包括insert size均值和标准差、insert size max、reads长度。比对文件即bam文件,insert size max为insert size均值+3.96×insert size标准差。
信号分类步骤12,包括从比对文件中按照设定长度提取区间内的reads,并将异常reads分成DP信号、SR信号和SU信号;DP信号是指insert size>insert size max或者两个配对read落在两个不同的染色体上的reads,SR信号是指发生软剪切的reads,SU信号是指read pair中只有一条匹配到参考序列的reads。例如,从bam文件中按照75k的长度并行提取区间内的reads。
DP信号聚类分析步骤13,包括对信号分类步骤获得的DP信号进行聚类,将距离在insert size max范围内、方向相同的reads作为一个DP信号簇,每个簇作为一个结构变异的候选。
融合断点分析步骤14,包括从DP信号聚类分析步骤获得的每个簇的insert size max范围内提取SR信号和SU信号,再加上相应的DP信号进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列。
SR信号分析步骤15,包括从信号分类步骤获得的SR信号中寻找嵌合比对,获得不包含DP信号的变异,在变异发生的区域附近即SR信号区间两侧insert size范围内,提取相对应得DP信号和SU信号,加入该区域附近,即SR信号区间两侧至少10bp,对应的参考序列进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列。
计算和注释步骤16,包括对融合断点分析步骤和SR信号分析步骤的结果进行融合断点左右两侧的突变深度计算、结构变异类型识别,由左右两个断点left_bp和right_bp和左右侧组装片段的比对方向对每一个结果进行注释。
具体的,根据这两个方向信息和断点1及断点2的相对位置信息判别结构变异类型;如果左右断点不在同一染色体,则为染色体间易位;其中,如果左右序列方向一致则为2型染色体间易位,若不一致则为1型染色体间易位;如果左右断点在同一染色体,且左右序列比对方向一致,则为染色体倒置;若断点1的位置在断点2之前并且断点1为反向比对,或断点1位置在断点2之后且断点2反向比对,则为染色体缺失;其余则为染色体重复。
注释结果合并和输出步骤17,包括对计算和注释步骤的注释结果进行合并,以合并因为DP信号和SR信号双重识别而产生的重合信息,将合并后的结果作为待测对象的结构变异检测结果。
本领域技术人员可以理解,上述方法的全部或部分功能可以通过硬件的方式实现,也可以通过计算机程序的方式实现。当上述方法中全部或部分功能通过计算机程序的方式实现时,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器、随机存储器、磁盘、光盘、硬盘等,通过计算机执行该程序以实现上述功能。例如,将程序存储在设备的存储器中,当通过处理器执行存储器中程序,即可实现上述全部或部分功能。另外,当上述实施方式中全部或部分功能通过计算机程序的方式实现时,该程序也可以存储在服务器、另一计算机、磁盘、光盘、闪存盘或移动硬盘等存储介质中,通过下载或复制保存到本地设备的存储器中,或对本地设备的系统进行版本更新,当通过处理器执行存储器中的程序时,即可实现上述方法中全部或部分功能。
因此,基于本申请用于结构变异检测的方法,本申请提出了一种用于结构变异检测的装置,如图2所示,包括数据获取模块21、信号分类模块22、DP信号聚类分析模块23、融合断点分析模块24、SR信号分析模块25、计算和注释模块26以及注释结果合并和输出模块27。
其中,数据获取模块21,包括用于获取待测对象二代测序数据的比对文件及其基本信息,基本信息包括insert size均值和标准差、insert size max、reads 长度。例如,比对文件为bam文件,insert size max为insert size均值+3.96×insert size标准差。
信号分类模块22,包括用于从比对文件中按照设定长度提取区间内的reads,并将异常reads分成DP信号、SR信号和SU信号;DP信号是指insert size>insert size max或者两个配对read落在两个不同的染色体上的reads,SR信号是指发生软剪切的reads,SU信号是指read pair中只有一条匹配到参考序列的reads。
DP信号聚类分析模块23,包括用于对信号分类模块获得的DP信号进行聚类,将距离在insert size max范围内、方向相同的reads作为一个DP信号簇,每个簇作为一个结构变异的候选。
融合断点分析模块24,包括用于从DP信号聚类分析模块获得的每个簇的insert size max范围内提取SR信号和SU信号,再加上相应的DP信号进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列。
SR信号分析模块25,包括用于从信号分类模块获得的SR信号中寻找嵌合比对,获得不包含DP信号的变异,在变异发生的区域附近,即SR信号区间两侧insert size范围内,提取相对应得DP信号和SU信号,加入该区域附近,即SR信号区间两侧至少10bp,对应的参考序列进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列。
计算和注释模块26,包括用于对融合断点分析模块和SR信号分析模块的结果进行融合断点左右两侧的突变深度计算、结构变异类型识别,由左右两个断点left_bp和right_bp和左右侧组装片段的比对方向对每一个结果进行注释。
注释结果合并和输出模块27,包括用于对计算和注释模块的注释结果进行合并,以合并因为DP信号和SR信号双重识别而产生的重合信息,将合并后的结果作为待测对象的结构变异检测结果。
本申请的另一实现方式中还提供了一种用于结构变异检测的装置,该装置包括存储器和处理器;存储器,包括用于存储程序;处理器,包括用于通过执行存储器存储的程序以实现以下方法:数据获取步骤,包括获取待测对象二代测序数据的比对文件及其基本信息,所述基本信息包括insert size均值和标准差、insert size max、reads长度;信号分类步骤,包括从所述比对文件中按照设定长度提取区间内的reads,并将异常reads分成DP信号、SR信号和SU信号;所述DP信号是指insert size>insert size max或者两个配对read落在两个不同的染色体上的reads,所述SR信号是指发生软剪切的reads,所述SU信号是指read pair中只有一条匹配到参考序列的reads;DP信号聚类分析步骤,包括对所述信号分类步骤获得的DP信号进行聚类,将距离在insert size max范围内、方向相同的 reads作为一个DP信号簇,每个簇作为一个结构变异的候选;融合断点分析步骤,包括从所述DP信号聚类分析步骤获得的每个簇的insert size max范围内提取SR信号和SU信号,再加上相应的DP信号进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列;SR信号分析步骤,包括从所述信号分类步骤获得的SR信号中寻找嵌合比对,获得不包含DP信号的变异,在变异发生的区域附件,即SR信号区间两侧insert size范围内,提取相对应得DP信号和SU信号,加入SR信号区间及其两侧至少10bp对应的参考序列进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列;计算和注释步骤,包括对所述融合断点分析步骤和所述SR信号分析步骤的结果进行融合断点左右两侧的突变深度计算、结构变异类型识别,由左右两个断点left_bp和right_bp以及左右侧组装片段的比对方向对每一个结果进行注释;注释结果合并和输出步骤,包括对所述计算和注释步骤的注释结果进行合并,以合并因为DP信号和SR信号双重识别而产生的重合信息,将合并后的结果作为待测对象的结构变异检测结果。
本申请另一种实现方式中还提供一种计算机可读存储介质,该存储介质中包括程序,该程序能够被处理器执行以实现如下方法:数据获取步骤,包括获取待测对象二代测序数据的比对文件及其基本信息,所述基本信息包括insert size均值和标准差、insert size max、reads长度;信号分类步骤,包括从所述比对文件中按照设定长度提取区间内的reads,并将异常reads分成DP信号、SR信号和SU信号;所述DP信号是指insert size>insert size max或者两个配对read落在两个不同的染色体上的reads,所述SR信号是指发生软剪切的reads,所述SU信号是指read pair中只有一条匹配到参考序列的reads;DP信号聚类分析步骤,包括对所述信号分类步骤获得的DP信号进行聚类,将距离在insert size max范围内、方向相同的reads作为一个DP信号簇,每个簇作为一个结构变异的候选;融合断点分析步骤,包括从所述DP信号聚类分析步骤获得的每个簇的insert size max范围内提取SR信号和SU信号,再加上相应的DP信号进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列;SR信号分析步骤,包括从所述信号分类步骤获得的SR信号中寻找嵌合比对,获得不包含DP信号的变异,在变异发生的区域附件,即SR信号区间两侧insert size范围内,提取相对应得DP信号和SU信号,加入SR信号区间及其两侧至少10bp对应的参考序列进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列;计算和注释步骤,包括对所述融合断点分析步骤和所述SR信号分析步骤的结果进行融合断点左右两侧的突变深度计算、结构变异类 型识别,由左右两个断点left_bp和right_bp以及左右侧组装片段的比对方向对每一个结果进行注释;注释结果合并和输出步骤,包括对所述计算和注释步骤的注释结果进行合并,以合并因为DP信号和SR信号双重识别而产生的重合信息,将合并后的结果作为待测对象的结构变异检测结果。
本申请用于结构变异检测的方法和装置,精度高、效率高、识别范围广,以下几个关键对达到这些效果起到了关键作用。首先是在提取完异常信号之后的对DP信号进行聚类的部分,其次是对聚类结果进行组装和重比对的过程,其三是重找回SR的部分,该步骤关注发生嵌合比对的SR信号,保证了即使是在DP信号较弱或者某些不包含DP信号的特殊变异也能被准确捕捉并进行局部组装,最后是在注释部分的微同源序列识别,包含对小片段插入和短序列串联重复的识别,即使是对断点模糊不清的区域,也会给出可能性最大的融合断点,并给出造成断点模糊的碱基序列。本申请的多处并行设计也是一大亮点,特别是在处理较大量数据的步骤的并行,在保证了精确性的同时也保证了运行的效率。以下是对本申请中的识别方法中几个关键点的详细阐述。
(1)DP聚类以发现结构变异候选区域
本申请的DP信号定义为插入片段不小于插入片段最大长度或比对至两个不同染色体的read pair。排除掉比对错误等其他极端因素造成的影响,DP信号聚集的区域一定有结构变异发生,传统的聚簇方法大多数基于密度聚类,好处是可以聚集到最富集的区域,缺点也很明显,就是容易漏掉某些关键信号,尤其是在测序深度较低的时候,密度聚类的效果往往不佳。本申请基于广度策略聚类DP信号,可以最大限度的保证将有用的DP信号聚进簇内,后续的局部组装和重比对策略保证了将簇内假阳性信号的影响降到最低。
聚类首先对左侧reads按照距离进行聚,具体方法如下:按照染色体并行从存放DP信号的临时BAM中逐条读取reads,然后按照距离进行分簇,如果下一条read距离已有簇的边界的距离小于insert size max,则将其加入簇中。然后对每一簇中对应的右侧reads按照同样的方法聚集,如果左侧的成簇reads被右侧reads分成多个簇,则分别记录为多个聚簇结果(clique),如图3所示。图3是聚簇过程中左边的reads已经聚成簇,而右侧reads却分成两簇的情况,本申请方法将其分别记为clique1和clique2。
(2)DP组装和重比对以确定精确断点和识别微同源序列
聚簇完成后,对于每一个候选区域,也就是每一个聚簇结果clique对应的区域,首先在区域内部寻找SR信号和SU信号的,将其放在组装软件SGA中进行组装。组装结果可能有很多个,但符合真实情况的可能并不多,需要进行 过滤,重比对的过程就是一个过滤的过程,本申请方法通过将组装结果重比对回对应的参考序列,获得包含断点和可能存在的短模板插入序列作为最可能的结构变异结果,该组装过程可以识别融合断点附近的短微同源序列和insert size范围内的novel insertion。通过该步骤,即可获得结构变异两侧的精确融合断点和断点两侧的碱基序列。
(3)SR信号重找回以发现较小和特殊的结构变异
本申请方法的SR重找回策略是另一个亮点。由于DP信号定义的是包含较大尺寸插入片段或者在不同染色体上的read pair,在聚簇过程中可能由于深度等其他原因仍然可能漏掉一些小的或者特殊的结构变异,重找回SR策略对这一问题进行了补充处理。重找回SR步骤先对所有SR信号中的SA信号的两个比对位置进行分别处理,相对于参考序列的左右两侧,在左右各自分别寻找附近区域内可能存在的DP信号和SU信号,并截取对应区域内的参考序列,将几种信号进行组装,从组装结果中识别出可能性最大的结果作为融合断点结果,这个处理过程与DP信号聚类和组装过程类似。重找回SR过程是对以DP信号为基础的检测过程的补充,可以保证对某些特殊结构变异的检出,使整体结果达到更高的检出率和精度。
本申请在检测热点区域较大型,包括染色体间的易位等,结构变异时具有优异的性能。
以下为本申请中用到的部分术语及其定义:
SV:Structural Variation,结构变异。
bam格式文件:SAM格式文件的二进制文件。SAM文件是一种固定格式的比对结果表示文件,一般由测序结果数据和参考序列比对产生。
DP信号:Discordant Pair,二代双端测序中,insert size>insert size max或者两个read比对位置相距较远或者在不同的染色体上。
SR信号:Split Reads,发生了剪切的reads,一个read的被分成两个部分,分别比对到不同的位置,即发生软剪切的reads。
SU信号:双端测序的两条reads中只有一条比对到参考基因。
insert size:双端测序中的打断片段大小。
实施例
本例结构变异检测方法具体如下:
输入:经过预处理的bam文件,参考系列;
1.数据获取步骤,包括获取待测对象二代测序数据的bam文件,计算bam 文件的基本信息,insert size均值和标准差、insert size max(insert size mean+3.96*insert size std)、reads长度;
2.信号分类步骤,从bam文件中按照75k的长度并行提取区间内的reads,并将异常reads分成四信号:DP(insert size>insert size max或者两个配对read落在两个不同的染色体上)、SR(发生了软剪切的reads)、SU(read pair中只有一条匹配到参考序列),提取完毕后放在临时文件中;
3.DP信号聚类分析步骤,对步骤2中提取的DP信号进行聚类,找到位置相近、方向相同的DP信号簇(clique),每个簇作为一个结构变异的候选;其中,位置相近即距离在insert size max范围内;
4.融合断点分析步骤,从步骤3中聚簇结果的每一个clique附近提取SR和SU信号,再加上DP信号进行组装,将组装结果进行重比对,以发现融合断点和微同源序列和其他的短模板插入;其中,每一个clique附近,即每个簇的insert size max范围内;
5.SR信号分析步骤,从步骤2中提取的SR信号中寻找SA信号(嵌合比对),以发现那些不包含DP信号的变异,在SR发生的区域附近提取相对应得DP和SU信号,加入位置附近对应的参考序列进行组装,将组装结果进行重比对以发现融合断点和可能的微同源和短模板插入序列;具体的,在SR信号区间两侧insert size范围内提取相对应得DP信号和SU信号,加入SR信号区间及其两侧至少10bp对应的参考序列进行组装;其中,不包含DP信号的变异主要包括因序列较短而不包含DP信号的变异,以及其他因特殊原因不包含DP信号的长序列突变;
6.计算和注释步骤,将步骤4和步骤5的每一个结果作为一个结构变异候选,进行融合断点左右两侧的突变深度计算、结构变异类型识别等,由左右两个断点left_bp和right_bp以及左右侧组装片段的比对方向对每一个结果进行注释;
7.注释结果合并和输出步骤,将注释结果进行合并,以合并那些因为DP和SR双重识别而产生的重合信息,并进行最终的结构变异结果输出。
其中,融合断点左右两侧是指,左断点的左侧和右断点的右侧,分别取左右两侧consensus序列中包含的DP信号、SR信号和SU信号的数量作为alt深度,取左右两侧两个深度中较大的一个作为突变深度,对应区间内DP信号、SR信号、SU信号和正常reads数量作为整体深度。
注释具体包括,根据这两个方向信息和断点1及断点2的相对位置信息判别结构变异类型;如果左右断点不在同一染色体,则为染色体间易位;其中, 如果左右序列方向一致则为2型染色体间易位,若不一致则为1型染色体间易位;如果左右断点在同一染色体,且左右序列比对方向一致,则为染色体倒置;若断点1的位置在断点2之前并且断点1为反向比对,或断点1位置在断点2之后且断点2反向比对,则为染色体缺失;其余则为染色体重复。
本例按照以上方法,对1729例阳性SV样例,一共两批Panel样本,进行结构变异检测,所有测序数据及样本由北京吉因加医学检验实验室有限公司提供。结果显示,本例的结构变异检测方法检出率达到99.595%。具体的,本例的小样本集一共340例样本,经过解读复核确认了484例阳性SV集合;补充验证数据集一共1091例样本,共确认1245例阳性SV集合;两个批次的结果中,只有7例未检出或是检出断点差异较大(200bp以内)的结果,前一批次2例,后一批次5例,剩余的结果中都精确检出,并且,比原有的检出软件结果断点精度更高。可以理解,部分样本包含不止一个阳性SV;因此,确认的阳性SV数量大于样本数。
本例的结构变异检测方法可以识别包括缺失、倒位、重复、染色体内易位、染色体间易位等多种结构变异类型,并提供断点附近的微同源序列和短模板序列输出。
基于本例的结构变异检测方法,本例进一步的研发了相应的软件ncsv2作为本例的结构变异检测装置,该装置以样本的排序后的BAM文件、比对信息、该样本的测序芯片的热点区域信息hotregion文件等作为输入,即可直接获得该样本的所有结构变异信息,存放在结果的csv文件中,每一条突变信息包含该突变的突变类型,两个突变断点位置信息,断点两侧基因的类型及数量,突变频率,支持该突变的DP、SR、SU的reads数,及突变的IGV图链接。本例的结构变异检测装置,能够高效、高精度的检出融合断点和断点两侧的碱基序列,能够识别多种结构变异类型,如缺失、倒位、重复、染色体内易位、染色体间易位等,并提供断点附近的微同源序列和短模板序列输出。
以上内容是结合具体的实施方式对本申请所作的进一步详细说明,不能认定本申请的具体实施只局限于这些说明。对于本申请所属技术领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干简单推演或替换。

Claims (10)

  1. 一种用于结构变异检测的方法,其特征在于:包括以下步骤,
    数据获取步骤,包括获取待测对象二代测序数据的比对文件及其基本信息,所述基本信息包括insert size均值和标准差、insert size max、reads长度;
    信号分类步骤,包括从所述比对文件中按照设定长度提取区间内的reads,并将异常reads分成DP信号、SR信号和SU信号;所述DP信号是指insert size>insert size max或者两个配对read落在两个不同的染色体上的reads,所述SR信号是指发生软剪切的reads,所述SU信号是指read pair中只有一条匹配到参考序列的reads;
    DP信号聚类分析步骤,包括对所述信号分类步骤获得的DP信号进行聚类,将距离在insert size max范围内、方向相同的reads作为一个DP信号簇,每个簇作为一个结构变异的候选;
    融合断点分析步骤,包括从所述DP信号聚类分析步骤获得的每个簇的insert size max范围内提取SR信号和SU信号,再加上相应的DP信号进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列;
    SR信号分析步骤,包括从所述信号分类步骤获得的SR信号中寻找嵌合比对,获得不包含DP信号的变异,在变异发生的区域附件,即SR信号区间两侧insert size范围内,提取相对应得DP信号和SU信号,加入SR信号区间及其两侧至少10bp对应的参考序列进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列;
    计算和注释步骤,包括对所述融合断点分析步骤和所述SR信号分析步骤的结果进行融合断点左右两侧的突变深度计算、结构变异类型识别,由左右两个断点left_bp和right_bp以及左右侧组装片段的比对方向对每一个结果进行注释;
    注释结果合并和输出步骤,包括对所述计算和注释步骤的注释结果进行合并,以合并因为DP信号和SR信号双重识别而产生的重合信息,将合并后的结果作为待测对象的结构变异检测结果。
  2. 根据权利要求1所述的方法,其特征在于:所述数据获取步骤中,比对文件为bam文件;
    优选的,insert size max为insert size均值+3.96×insert size标准差。
  3. 根据权利要求1所述的方法,其特征在于:所述信号分类步骤中,设定长度为75k。
  4. 根据权利要求1-3任一项所述的方法,其特征在于:所述计算和注释步骤中,融合断点左右两侧是指,左断点的左侧和右断点的右侧,分别取左右两侧 consensus序列中包含的DP信号、SR信号和SU信号的数量作为alt深度,取左右两侧两个深度中较大的一个作为突变深度,对应区间内DP信号、SR信号、SU信号和正常reads数量作为整体深度;
    优选的,所述计算和注释步骤中,对每一个结果进行注释,具体包括,根据这两个方向信息和断点1及断点2的相对位置信息判别结构变异类型;如果左右断点不在同一染色体,则为染色体间易位;其中,如果左右序列方向一致则为2型染色体间易位,若不一致则为1型染色体间易位;如果左右断点在同一染色体,且左右序列比对方向一致,则为染色体倒置;若断点1的位置在断点2之前并且断点1为反向比对,或断点1位置在断点2之后且断点2反向比对,则为染色体缺失;其余则为染色体重复。
  5. 一种用于结构变异检测的装置,其特征在于:包括数据获取模块、信号分类模块、DP信号聚类分析模块、融合断点分析模块、SR信号分析模块、计算和注释模块以及注释结果合并和输出模块;
    所述数据获取模块,包括用于获取待测对象二代测序数据的比对文件及其基本信息,所述基本信息包括insert size均值和标准差、insert size max、reads长度;
    所述信号分类模块,包括用于从所述比对文件中按照设定长度提取区间内的reads,并将异常reads分成DP信号、SR信号和SU信号;所述DP信号是指insert size>insert size max或者两个配对read落在两个不同的染色体上的reads,所述SR信号是指发生软剪切的reads,所述SU信号是指read pair中只有一条匹配到参考序列的reads;
    所述DP信号聚类分析模块,包括用于对所述信号分类模块获得的DP信号进行聚类,将距离在insert size max范围内、方向相同的reads作为一个DP信号簇,每个簇作为一个结构变异的候选;
    所述融合断点分析模块,包括用于从所述DP信号聚类分析模块获得的每个簇的insert size max范围内提取SR信号和SU信号,再加上相应的DP信号进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列;
    所述SR信号分析模块,包括用于从所述信号分类模块获得的SR信号中寻找嵌合比对,获得不包含DP信号的变异,在变异发生的区域附近,即SR信号区间两侧insert size范围内,提取相对应得DP信号和SU信号,加入该区域附近,即SR信号区间两侧至少10bp,对应的参考序列进行组装,对组装结果进行重比对,获得融合断点、微同源序列和/或短模板插入序列;
    所述计算和注释模块,包括用于对所述融合断点分析模块和所述SR信号分析模块的结果进行融合断点左右两侧的突变深度计算、结构变异类型识别,由左右两个断点left_bp和right_bp和左右侧组装片段的比对方向对每一个结果进行注释;
    所述注释结果合并和输出模块,包括用于对所述计算和注释模块的注释结果进行合并,以合并因为DP信号和SR信号双重识别而产生的重合信息,将合并后的结果作为待测对象的结构变异检测结果。
  6. 根据权利要求5所述的装置,其特征在于:所述数据获取模块中,比对文件为bam文件;
    优选的,insert size max为insert size均值+3.96×insert size标准差。
  7. 根据权利要求5所述的装置,其特征在于:所述信号分类模块中,设定长度为75k。
  8. 根据权利要求5-7任一项所述的装置,其特征在于:所述计算和注释模块中,融合断点左右两侧是指,左断点的左侧和右断点的右侧,分别取左右两侧consensus序列中包含的DP信号、SR信号和SU信号的数量作为alt深度,取左右两侧两个深度中较大的一个作为突变深度,对应区间内DP信号、SR信号、SU信号和正常reads数量作为整体深度;
    优选的,所述计算和注释模块中,对每一个结果进行注释,具体包括,根据这两个方向信息和断点1及断点2的相对位置信息判别结构变异类型;如果左右断点不在同一染色体,则为染色体间易位;其中,如果左右序列方向一致则为2型染色体间易位,若不一致则为1型染色体间易位;如果左右断点在同一染色体,且左右序列比对方向一致,则为染色体倒置;若断点1的位置在断点2之前并且断点1为反向比对,或断点1位置在断点2之后且断点2反向比对,则为染色体缺失;其余则为染色体重复。
  9. 一种用于结构变异检测的装置,其特征在于:所述装置包括存储器和处理器;
    所述存储器,包括用于存储程序;
    所述处理器,包括用于通过执行所述存储器存储的程序以实现权利要求1-4任一项所述的用于结构变异检测的方法。
  10. 一种计算机可读存储介质,其特征在于:所述存储介质中存储有程序,所述程序能够被处理器执行以实现权利要求1-4任一项所述的用于结构变异检测的方法。
PCT/CN2023/082917 2022-03-28 2023-03-21 一种用于结构变异检测的方法、装置和存储介质 WO2023185559A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210314220.6A CN114743594B (zh) 2022-03-28 2022-03-28 一种用于结构变异检测的方法、装置和存储介质
CN202210314220.6 2022-03-28

Publications (1)

Publication Number Publication Date
WO2023185559A1 true WO2023185559A1 (zh) 2023-10-05

Family

ID=82277109

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/082917 WO2023185559A1 (zh) 2022-03-28 2023-03-21 一种用于结构变异检测的方法、装置和存储介质

Country Status (2)

Country Link
CN (1) CN114743594B (zh)
WO (1) WO2023185559A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117079720A (zh) * 2023-10-16 2023-11-17 北京诺禾致源科技股份有限公司 高通量测序数据的处理方法和装置
CN117827685A (zh) * 2024-03-05 2024-04-05 国网浙江省电力有限公司丽水供电公司 一种模糊测试输入生成方法、装置、终端及介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114743594B (zh) * 2022-03-28 2023-04-18 深圳吉因加医学检验实验室 一种用于结构变异检测的方法、装置和存储介质
CN115910199B (zh) * 2022-11-01 2023-07-14 哈尔滨工业大学 一种基于比对框架的三代测序数据结构变异检测方法
CN115831223B (zh) * 2023-02-20 2023-06-13 吉林工商学院 一种挖掘近源物种间染色体结构变异的分析方法及系统
CN116343923B (zh) * 2023-03-21 2023-12-08 哈尔滨工业大学 一种基因组结构变异同源性识别方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110010193A (zh) * 2019-05-06 2019-07-12 西安交通大学 一种基于混合策略的复杂结构变异检测方法
CN110033829A (zh) * 2019-04-11 2019-07-19 北京诺禾心康基因科技有限公司 基于差异snp标记物的同源基因的融合检测方法
US20190318806A1 (en) * 2018-04-12 2019-10-17 Illumina, Inc. Variant Classifier Based on Deep Neural Networks
CN112349346A (zh) * 2020-10-27 2021-02-09 广州燃石医学检验所有限公司 检测基因组区域中的结构变异的方法
CN114743594A (zh) * 2022-03-28 2022-07-12 深圳吉因加医学检验实验室 一种用于结构变异检测的方法、装置和存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11004538B2 (en) * 2013-05-15 2021-05-11 Bgi Genomics Co., Ltd. Method and device for detecting chromosomal structural abnormalities
CN111566227A (zh) * 2017-11-09 2020-08-21 多弗泰尔基因组学有限责任公司 结构变体分析

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318806A1 (en) * 2018-04-12 2019-10-17 Illumina, Inc. Variant Classifier Based on Deep Neural Networks
CN110033829A (zh) * 2019-04-11 2019-07-19 北京诺禾心康基因科技有限公司 基于差异snp标记物的同源基因的融合检测方法
CN110010193A (zh) * 2019-05-06 2019-07-12 西安交通大学 一种基于混合策略的复杂结构变异检测方法
CN112349346A (zh) * 2020-10-27 2021-02-09 广州燃石医学检验所有限公司 检测基因组区域中的结构变异的方法
CN114743594A (zh) * 2022-03-28 2022-07-12 深圳吉因加医学检验实验室 一种用于结构变异检测的方法、装置和存储介质

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117079720A (zh) * 2023-10-16 2023-11-17 北京诺禾致源科技股份有限公司 高通量测序数据的处理方法和装置
CN117079720B (zh) * 2023-10-16 2024-01-30 北京诺禾致源科技股份有限公司 高通量测序数据的处理方法和装置
CN117827685A (zh) * 2024-03-05 2024-04-05 国网浙江省电力有限公司丽水供电公司 一种模糊测试输入生成方法、装置、终端及介质
CN117827685B (zh) * 2024-03-05 2024-04-30 国网浙江省电力有限公司丽水供电公司 一种模糊测试输入生成方法、装置、终端及介质

Also Published As

Publication number Publication date
CN114743594A (zh) 2022-07-12
CN114743594B (zh) 2023-04-18

Similar Documents

Publication Publication Date Title
WO2023185559A1 (zh) 一种用于结构变异检测的方法、装置和存储介质
CN110010193B (zh) 一种基于混合策略的复杂结构变异检测方法
Zhang et al. Real-time mapping of nanopore raw signals
CN110289047B (zh) 基于测序数据的肿瘤纯度及绝对拷贝数预测方法及系统
WO2017123864A1 (en) Systems and methods for analyzing circulating tumor dna
CN111326212B (zh) 一种结构变异的检测方法
CN110299185B (zh) 一种基于新一代测序数据的插入变异检测方法及系统
CN103993069A (zh) 病毒整合位点捕获测序分析方法
CN111583996B (zh) 一种模型非依赖的基因组结构变异检测系统及方法
CN112951418A (zh) 基于液体活检的连锁区域甲基化评估方法和装置、终端设备及存储介质
CN107229839B (zh) 一种基于新一代测序数据的Indel检测方法
WO2018218787A1 (zh) 一种基于局部图的三代测序序列校正方法
CN111243663A (zh) 一种基于模式增长算法的基因变异检测方法
EP2836948A1 (en) Biological cell assessment using whole genome sequence and oncological therapy planning using same
WO2018232580A1 (zh) 基于三代捕获测序对二倍体基因组单倍体分型的方法和装置
CN111292809B (zh) 用于检测rna水平基因融合的方法、电子设备和计算机存储介质
Chen et al. MutScan: fast detection and visualization of target mutations by scanning FASTQ data
CN103793626B (zh) 碱基序列比对系统及方法
WO2024051097A1 (zh) 肿瘤特异环状rna的新抗原鉴定方法及装置、设备、介质
Jiang et al. Long-read based novel sequence insertion detection with rCANID
CN115831222A (zh) 一种基于三代测序的全基因组结构变异鉴定方法
WO2019023978A1 (zh) 比对方法、装置及系统
CN112735527B (zh) 一种串联序列解析方法、装置和存储介质
CN114464252B (zh) 一种检测结构变异的方法及装置
CN117935914B (zh) 一种意义未明的克隆性造血识别及其应用方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23777932

Country of ref document: EP

Kind code of ref document: A1