CN114464252A

CN114464252A - Method and device for detecting structural variation

Info

Publication number: CN114464252A
Application number: CN202210093787.5A
Authority: CN
Inventors: 刘涛; 苏亚男; 何俊义; 吴永鑫; 方欢
Original assignee: Shenzhen Guiinga Medical Laboratory
Current assignee: Shenzhen Guiinga Medical Laboratory
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2022-05-10
Anticipated expiration: 2042-01-26
Also published as: CN114464252B

Abstract

A method and an apparatus for detecting structural variation are provided, the method for detecting structural variation includes: calculating, namely calculating the length of the maximum insert of a sequencing library of a sample to be detected; a signal extraction step, which comprises extracting SU signals, DP signals and SR signals from the sequencing data of a sample to be detected; and signal processing, including processing the SR signal, the SU signal and the DP signal to obtain structural variation information. The invention can support the detection of the structural variation of the tumor genome by extracting and processing the three signals.

Description

Method and device for detecting structural variation

Technical Field

The invention relates to the field of bioinformatics, in particular to a method and a device for detecting structural variation.

Background

The Structural Variation (SV) of the genome refers to a sequence change and a position change with a large length (greater than or equal to 50bp) on the genome, and includes insertion (insertion), deletion (deletion), duplication (duplication), inversion (inversion), translocation (translocation) and the like, which are important sources of the genome variation and have a relationship with evolution, genetic diseases, tumors and the like.

The common type of the second-generation sequencing technology is a double-end (PE) sequencing type, that is, a fixed length is measured for each of the forward and reverse directions of the same template strand (template), two sequences (reads) corresponding to the output data are read1 and read2 (matable with each other), these original sequences are analyzed by alignment software, the positions of the original sequences on a reference genome can be identified, the matching is good under normal conditions, the alignment position difference between the two sequences does not exceed the length distribution of the template strand, but some alignments exist, such as: the sequences are aligned in several parts to different positions of a reference genome (split reading mapping, SR), the sequence ends are aligned to other positions of the reference genome or not aligned to any position of the reference genome (soft clip, SC), one of the two sequences is aligned normally to the other one (single aligned, SU), the two sequences are aligned to two chromosomes or aligned at very different positions (abnormal pair, DP), the sequence coverage depth at the alignment is abnormal (coverage, depth of coverage, DOC), etc., and these alignments can be used as signals for supporting the detection of structural variation.

With these signals derived from the original reads alignment information, from this result, how to reverse the possible structural variation events, including the occurrence of chromosomes, breakpoints, directions, etc., is a problem to be solved by the theoretical method and software application for detecting structural variation based on the next generation sequencing development in nearly fifteen years, and the differences of these software are mainly reflected in: the type of the extracted signal, the processing method of the signal, whether to Assemble (AS), and the like.

Disclosure of Invention

According to a first aspect, in an embodiment, there is provided a method of detecting structural variation, comprising:

calculating, namely calculating the length of the maximum insert of a sequencing library of a sample to be detected;

a signal extraction step, which comprises extracting SU signals, DP signals and SR signals from the sequencing data of a sample to be detected;

and a signal processing step, including processing the SR signal, the SU signal and the DP signal to obtain structural variation information.

According to a second aspect, in an embodiment, there is provided an apparatus for detecting structural variation, comprising:

the calculation module is used for calculating the length of the maximum insert of the sequencing library of the sample to be detected;

the signal extraction module is used for extracting SU signals, DP signals and SR signals from sequencing data of a sample to be detected;

and the signal processing module is used for processing the SR signal, the SU signal and the DP signal to obtain structural variation information.

According to a third aspect, in an embodiment, there is provided an apparatus comprising:

a memory for storing a program;

a processor for implementing the method of the first aspect by executing the program stored by the memory.

According to a fourth aspect, in an embodiment, there is provided a computer readable storage medium having a program stored thereon, the program being executable by a processor to implement the method of the first aspect.

According to the method and the device for detecting structural variation of the embodiments, the detection of structural variation of tumor genomes can be supported by extracting and processing the three signals.

Drawings

Fig. 1 is a flowchart of detecting structural variation information according to embodiment 1.

Detailed Description

The present invention will be described in further detail with reference to the following embodiments. Wherein like elements in different embodiments are numbered with like associated elements. In the following description, numerous specific details are set forth in order to provide a better understanding of the present application. However, those skilled in the art will readily recognize that some of the features may be omitted or replaced with other elements, materials, methods in different instances. In some instances, certain operations related to the present application have not been shown or described in this specification in order not to obscure the core of the present application with unnecessary detail, and it is not necessary for those skilled in the art to describe these operations in detail, so that they may be fully understood from the description in the specification and the general knowledge in the art.

Furthermore, the features, operations, or characteristics described in the specification may be combined in any suitable manner to form various embodiments. Also, the various steps or actions in the method descriptions may be transposed or transposed in order, as will be apparent to one of ordinary skill in the art. Thus, the various sequences in the specification are for the purpose of clearly describing one embodiment only and are not meant to be necessarily order unless otherwise indicated where a certain order must be followed.

The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning.

Interpretation of terms

As used herein, Structural Variations (SV) refer to sequence changes and position changes with a large length (greater than or equal to 50bp) on a genome, including insertions (insertions), deletions (deletions), duplications (duplicates), inversions (inversions), translocations (translocations), and the like, and are important sources of genome Variations, and are related to evolution, genetic diseases, tumors, and the like.

a calculation step, including calculating a maximum insert length;

a signal extraction step, which includes extracting an SU signal (SU, single piece of comparison information), a DP signal (DP, abnormal pair information), and an SR signal (split reads mapping, split comparison information) from sequencing data of a sample to be detected;

In one embodiment, the maximum insert length is calculated as follows: includes calculating the mean and variance of the normally distributed insert lengths, the sum of the mean and the triple variance being the maximum insert length, the statistic being sample-specific.

In an embodiment, in the signal extracting step, the SU signal includes a sequence pair that satisfies the following conditions: flag is not a secondary or supplemental alignment; one of the two sequences is unaligned or its mate sequence is unaligned, and the unaligned of the two sequences is not a duplicate alignment.

In an embodiment, in the signal extraction step, when the SU signal is extracted, the sequence with flag as the UNM AP in the sequence pair meeting the condition is written into a temporary file and sorted, wherein the name of the sequence is named by the information of its mate sequence, and the information includes the name, the start position, the end position, and the alignment quality of the reference genome on alignment.

In one embodiment, in the signal extracting step, the DP signal includes a sequence pair satisfying the following conditions simultaneously: flag is not SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, and the length of the template strand corresponding to two different chromosomes or two sequences in an alignment is greater than the maximum insert length.

In one embodiment, in the signal extraction step, when the DP signal is extracted, the sequence pairs meeting the condition are written out to a temporary file and sorted, wherein the names of the sequences are named by the information of their mate sequence, which includes the name of the reference genome on the alignment, the start position, the end position, and the alignment quality.

In an embodiment, in the signal extracting step, the SR signal includes a sequence pair that does not satisfy the SU signal extracting condition and the DP signal extracting condition and satisfies the following conditions: the head or tail of cigar is a soft clip type sequence.

In an embodiment, in the signal extraction step, when the SR signal is extracted, soft clip information of a sequence meeting the condition is written into the temporary text file, where the soft clip information includes the name and alignment position of the soft clip located at the head or tail, the alignment reference genome, the alignment direction, the soft clip sequence, the base quality of the soft clip, and the sequence alignment quality.

In one embodiment, the sequencing data comprises sequencing data aligned to a reference genome.

In one embodiment, the reference genome comprises a human reference genome.

In one embodiment, the reference genome comprises at least a portion of an hg19 (also known as GRCh37) genome, an hs37d5 genome, a b37 genome, an hg18 genome, an hg17 genome, an hg16 genome, or an hg38 genome.

In one embodiment, the test sample comprises genomic DNA.

In one embodiment, the sequencing data comprises region capture sequencing data (also known as targeted capture sequencing data).

In an embodiment, the SR signal processing includes processing soft clip information, and specifically includes the following steps:

merging step, including merging SOFT clip information on a plurality of sequences, defining the merged result as SOFT as SR signal, the SOFT name comprises the name of reference genome on the alignment, the alignment position, the SOFT located at the head or tail of the sequence, the alignment direction, SOFT count, the alignment quality and the SOFT serial number;

a step of comparing, which comprises comparing the SOFT obtained in the step of merging to obtain comparison information;

and (3) structure variation information extraction, namely calculating the preliminary information of two candidate breakpoints according to the information (front: original comparison) of the sequence where the SOFT is positioned and the SOFT weight comparison information (back): and recording the information of the sequence where the SOFT is positioned as a first breakpoint, recording the SOFT realignment information as a second breakpoint, and determining structural variation information according to the information of the first breakpoint and the second breakpoint.

In one embodiment, the merging step includes constructing an ordered soft clip information set, and using the qualified soft clip p information as the independent SR signal.

In one embodiment, in the merging step, preliminary sorting is performed according to the position information of the soft clip located at the head or the tail of the sequence and the reference genome on the alignment, and the soft clips located at the head or the tail of the soft clip and the same name, alignment position and alignment direction of the reference genome on the alignment are used as a cluster to construct an ordered soft clip set.

In one embodiment, the clusters include soft clip clusters at the head of forward sequences (forward reads) and soft clip clusters at the tail of reverse sequences (reverse reads).

In one embodiment, for the soft clip cluster type at the head of the forward sequence, a new soft clip set is constructed according to the reverse order, the deduplication count and the sorting of the soft clip.

In one embodiment, for the soft clip cluster type at the tail of the reverse sequence, a new soft clip set is constructed according to the original order, the deduplication count and the sorting of the soft clip.

In one embodiment, in the merging step, after an ordered soft clip set is constructed, information meeting the following conditions is used as an independent SR signal: comparing the front soft clip with the rear soft clip, and if the soft clip is shortened or has no prefix relation, taking the soft clip p and the count thereof as independent SR signals; if the soft clip is longer and conforms to the prefix relationship, the longest soft clip and its count are promoted as the independent SR signal.

In one embodiment, after obtaining the independent SR signal, filtering out short sequences that do not meet the condition, and defining each record in the obtained file as the SOFT.

In one embodiment, the unconditionally short sequence comprises an SR signal read of less than 20 bp.

In an embodiment, the step of comparing the double ends includes performing single-ended comparison on the SOFT obtained in the step of combining to obtain comparison information.

In one embodiment, in the structural variation information extraction step, two candidate breakpoints are calculated according to two parts of information (original alignment) of a sequence where the SOFT is located and SOFT double alignment information, so as to determine structural variation information.

In one embodiment, in the structure variation information extraction step, the following sequences are extracted from the SOFT weight ratio information: flag is not the result of the comparison of SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, and the second breakpoint is calculated.

In one embodiment, the preliminary information of the two candidate breakpoints includes the following information: the chromosome where the breakpoint is located, the breakpoint position, the strand direction of the sequence where the breakpoint SOFT is located (is _ reverse), the fusion direction of the sequence where the breakpoint SOFT is located (is _ gene), the strand direction of SOFT rearrangement (is _ reverse), and the fusion direction of SOFT rearrangement (is _ gene).

In an embodiment, the chain direction and the fusion direction specifically include the following four cases:

1) after the SOFT overlap alignment of the forward sequence header source, the forward alignment is carried out to the left side of the second breakpoint;

2) after the SOFT overlap alignment of the forward sequence header source, the reverse complementary alignment is carried out to the right side of the second breakpoint;

3) after the SOFT from the tail part of the reverse sequence is subjected to re-alignment, the forward alignment is carried out to the right side of a second breakpoint;

4) after the SOFT realignment of the tail part of the reverse sequence, the reverse complementary alignment is carried out to the left of the second breakpoint.

In one embodiment, the details of the occurrence of the structural variation can also be determined, such as the type of structural variation determined from the chromosome on which the breakpoint is located, the location of the breakpoint, and the relative orientation of the breakpoint SOFT to the strand direction.

In one embodiment, the structure variation information extracting step includes the steps of: the region extended 10bp on the side where the SOFT is located, the region extended the maximum insert length-2X the sequencing sequence length on the opposite side with the SOFT weight. The two areas are used as search intervals.

In one embodiment, in the structure variation information extraction step, the DP signals that meet the conditions are found in the vicinity of the two breakpoints as a supplement, and the normal sequences that meet the conditions are found in the vicinity of the first breakpoint and used for calculating the structure variation frequency.

In one embodiment, in the structure variation information extracting step, the DP signal that meets the condition includes: and if no DP support is found and the SOFT count is less than 2, the SR signal is too weak, namely the SR signal support evidence is insufficient, and the SR signal can be filtered and removed.

In an embodiment, in the structure variation information extraction step, when the qualified DP signal requires chain direction matching judgment, the mate reads chain direction corresponding to the sequence in which the SOFT is located is used as the chain direction near the first breakpoint.

In one embodiment, the step of extracting structural variation information comprises the steps of: the first breakpoint extends the maximum inserted segment length to the left to the region between the breakpoints.

In one embodiment, in the structural variation information extracting step, the qualified normal sequences include: flag is not alignment sequences (reads) of SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, the length of the template chain is not calculated to be larger than the length of the maximum insertion fragment or sequences (reads) aligned to different chromosomes are not calculated, and the support number of normal sequences is determined according to the number of sequences in the search interval at the initial position of alignment.

In an embodiment, during SR signal processing, a summary step is further included, specifically, after the structural variation information extraction step, since two breakpoints determined by multiple SOFT may be the same, it needs to be combined into one. If two breakpoints determined by multiple SOFTs are the same, the two breakpoints are combined into one breakpoint.

In one embodiment, a method of merging breakpoints includes: merging by taking the strand direction (is _ reverse) and the fusion direction (is _ gene) of the sequence of the two breakpoints, the position of the breakpoints and the sequence of the breakpoints SOFT as indexes, and the strand direction (is _ reverse) and the fusion direction (is _ gene) of SOFT realignment as well as updating the SR signal support number.

In one embodiment, during SU signal processing, extracted SU signals are compared again and then locally assembled to obtain structural variation information. Since there are fewer sequences assembled back to the reference genome, there are fewer structural variations detected after SU signal processing.

In one embodiment, in processing the DP signal, the processing method of the DP signal includes: forward sequences (forward reads) are located to the left of the possible structural variation break (usually, most of the forward sequences are located to the left of the possible structural variation break), reverse sequences (reverse reads) are located to the right of the possible structural variation break (usually, most of the reverse sequences are located to the right of the possible structural variation break), forward and reverse strands are distinguished based on these two rules, and the sequences of the DP signal (which have been sequenced based on the alignment start positions) are clustered: comparing adjacent sequences, and finishing the first clustering if the same chromosome is compared and the difference of the comparison initial positions is less than or equal to 200bp, and determining a first candidate breakpoint; sorting the result again according to a fixed sequence (the comparison position is compared with the mate comparison position, the result is sorted according to the whole genome level, the mate is possibly corresponding to a plurality of clusters and needs sorting), clustering partial mate, comparing adjacent sequences, finishing secondary clustering if the same chromosome is compared and the difference of the comparison initial positions is less than or equal to 200bp, and determining a second candidate breakpoint; DP signals paired into clusters near two candidate breakpoints are preliminarily determined.

In an embodiment, in the processing method of the DP signal, there may be more than one second candidate breakpoint, if the number of sequences clustered at the first two times is greater than or equal to 5, clustering is performed for the third time, at this time, if the difference between a plurality of second candidate breakpoints is greater than or equal to 200bp, clustering is performed separately, and the number of sequences is returned again to be greater than or equal to 5, so as to obtain the final DP signal.

In one embodiment, in the DP signal processing method, if the DP signal is marked during the SR signal processing and the SU signal processing, the DP signal is skipped during the DP signal processing.

In one embodiment, in the DP signal processing method, the number of supported normal sequences (the same as the SR signal) is determined, and multiple sides are used as the sources of the first breakpoint 1.

In one embodiment, in the DP signal processing method, the type of structural variation is obtained after determining the chromosome, the breakpoint position, the strand direction, and the fusion direction of two breakpoints of the DP signal.

In one embodiment, after all signal processing is completed, structural variation information is obtained, and the structural variation information includes the following information: chromosome, position and strand orientation of two breakpoints, fusion direction, sequence count of support signals, count of normal sequences, and quality value information.

a memory for storing a program;

In one embodiment, the present invention can detect structural variation of genome, supporting the detection of structural variation of tumor genome.

In one embodiment, the invention is divided into two aspects of signal extraction and signal processing, including three types of signals of SR, SU and DP, and information of chromosomes, breakpoints, directions and the like of structural variation occurrence is determined based on the signals.

Example 1

Fig. 1 is a flowchart of detecting structural variation information according to embodiment 1, including the following steps:

1. calculating the maximum insert length

The mean and variance were calculated for normally distributed insert lengths, and the sum of the mean and triple variance was recorded as the maximum insert length, which was sample-specific.

2. In this embodiment, a signal is extracted based on a comparison file, each row in the comparison file represents a comparison condition of each sequence (reads), and includes second column information flag of the comparison file, fourth column information CIGAR of the comparison file, and the like, and after the signal is extracted, the signals are respectively defined as SR, SU, and DP.

2.1 the SU signal extracted in this embodiment refers to a pair of reads that simultaneously satisfy the following conditions: flag is not SECONDARY alignment (SECONDARY, one of the reads may align to multiple positions on the chromosome due to multiple alignments, primary alignment result is primary) or SUPPLEMENTARY alignment (SUPPLEMENTARY, most of the alignments of one of the reads are normally presentational and the other parts align elsewhere due to chimeric alignment), one of the two reads is Unaligned (UNMAP) or its mattemads is unlikely to be aligned (munnap), and the unaligned sequence of the two reads is not repeated alignment (DUP), one of the reads that satisfies the condition is UNMAP is written out to a temporary bam file and sorted, wherein the name of the reads is named with information of its mate reads, which includes the name of the reference genome on the alignment, start position, end position, alignment quality.

2.2 the DP signal extracted in this embodiment refers to a pair of reads that simultaneously satisfy the following conditions: flag is not SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, the length of the template strand corresponding to two different chromosomes or two reads on two comparisons is larger than the length of the maximum insert, the reads pairs meeting the condition are written out to a temporary bam file and sorted, wherein the names of the reads are named by the information of the mate reads, and the information comprises the names, the starting positions, the ending positions and the comparison quality of the reference genomes on the comparison.

2.3 the SR signal extracted in this embodiment refers to a pair of reads that do not satisfy SU and DP extraction conditions and satisfy the following conditions: the head or the tail of the cigar is reads of soft clip type, and soft clip information of the reads meeting the conditions is written into the temporary text file, wherein the soft clip information comprises the name and the alignment position of a reference genome, the alignment direction, the soft clip sequence, the base quality of the soft clip and the alignment quality of the reads, wherein the soft clip is positioned at the head or the tail.

3. Processing soft clip information

3.1 merging soft clip information to construct SR signal

The temporary text files containing soft clip information are subjected to preliminary sequencing according to the position information of the reference genome of the soft clip positioned at the head or the tail and the alignment, the soft clip positioned at the head or the tail and the same name, alignment position and alignment direction of the reference genome of the alignment is taken as a cluster, and the cluster can be divided into two types: constructing a new soft clip set according to the reverse order, the duplication removal count and the sequencing of the soft clip cluster type of a forward sequence (forward) header; and (3) constructing a new soft clip set for the soft clip cluster type at the tail part of the reverse sequences (reverse reads) according to the original order, the duplicate removal counting and the sorting of the soft clip.

The constructed ordered soft clip set can be used as an independent SR signal according to the following conditions: comparing the front soft clip with the rear soft clip, if the soft clip is shortened or has no prefix relation, using the soft clip and the count (1) thereof as independent SR signals; if the soft clip is longer and conforms to the prefix relationship, the longest soft clip and its count (the number of soft clips including the prefix relationship is greater than 1) are promoted as the independent SR signal. And filtering out a part of the SR signal, which is smaller than 20bp, and writing the SR signal into a temporary fastq file, wherein each record in the temporary fastq file is defined as SOFT to distinguish an original SOFT clip before combination, wherein the SOFT name comprises the name and the alignment position of a reference genome on alignment, the SOFT is positioned at the head or the tail, the alignment direction, the SOFT count, the alignment quality and the several SOFTs. The process also records the count and length results of all SOFTs, and the index of the SOFT is the number of SOFTs for subsequent de-duplication and combination.

3.2 Single ended weight comparison of SOFT for temporary fastq files

3.3 extracting detailed structural variation information from the result of the weight ratio

Regardless of the flag being the comparison result of SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, the preliminary information of two candidate breakpoints can be calculated from the two information of ready information (marked as breakpoint1, namely the first breakpoint) and SOFT weight comparison information (marked as breakpoint2, namely the second breakpoint) where the SOFT is located: the chromosome of the breakpoint, the position of the breakpoint, the strand direction (is _ reverse) of the duplication of the reads of the breakpoint SOFT with the SOFT, and the fusion direction (is _ gene) can be divided into four cases (as shown in table 1, 0 indicates no, 1 indicates yes): after the SOFT rearrangement of the sources at the forward heads (left sides), the forward direction is aligned to the left side of breakpoint2 or the reverse complementary alignment is aligned to the right side of breakpoint2, after the SOFT rearrangement of the sources at the tail (right sides) of reverse reads, the forward direction is aligned to the right side of breakpoint2 or the reverse complementary alignment is aligned to the left side of breakpoint2, and further, the details of the structural variation can be determined, for example, the type of the structural variation can be determined by the chromosome where the breakpoint is located, the breakpoint position and the SOFT comparison chain direction.

TABLE 1SOFT

The SOFT and the weight ratio information determine the SR signal of the structural variation, and the DP signals meeting the conditions are searched in the areas near two breakpoints (the side where the SOFT is located extends for 10bp, the SOFT weight ratio extends for the maximum insert length-2 × sequencing reads length on the opposite side and serves as a search interval) to be used as supplements: and (3) determining the quality value and the support number of the DP signal when the aligned chromosome and the aligned position corresponding to the two reads of the DP type are consistent with breakpoint1 and breakpoint2 and consistent with each other in chain direction (the chain direction of the mate reads of the reads where the SOFT is located is taken as the chain direction near breakpoint1), wherein the found DP support is not considered in the later DP signal processing, and if the DP support is not found and the count of the SOFT is less than 2, filtering the SR signal.

Finding the normal reads meeting the conditions in the area near breakpoint1 (from the maximum insertion segment length of left flash to the breakpoint as the search interval): the support number of normal reads is determined based on the number of reads in the search interval from the initial position of alignment, without calculating flag, alignment reads of SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, and reads of which the length of the template strand is larger than the length of the maximum insert or aligned to different chromosomes.

3.4 summary of

Since the two breakpoints determined by multiple SOFT's may be the same, they need to be merged into one: merging by taking chromosomes where two breakpoints are located, the positions of the breakpoints, the direction (is _ reverse) of the comparison between the reads where the breakpoint SOFT is located and the SOFT and the fusion direction (is _ gene R) as indexes, and updating the SR signal support number.

4. Processing of SU signals

And comparing the extracted SU signals again, and performing local assembly to obtain structural variation information. Since there are fewer sequences assembled back to the reference genome, there are fewer structural variations detected after SU signal processing.

5. Processing DP signals

The alignment around the structural variation is: forward reads are located to the left of the possible structural variation break point and reverse reads are located to the right of the possible structural variation break point, based on which the distinction between forward and reverse chains clusters reads (which have been ordered based on alignment start positions) of the DP signal: comparing adjacent reads, and if the same chromosome is compared and the difference between the initial positions of the comparison is less than or equal to 200bp, finishing the first clustering, and determining a first candidate breakpoint; sequencing the result again according to a fixed sequence (the comparison position is compared with the mate comparison position, sequencing is carried out according to the whole genome level, the mate possibly corresponds to a plurality of clusters and needs sequencing), clustering part of the mate, comparing adjacent sequences, finishing secondary clustering if the same chromosome is compared and the comparison starting position has a difference of less than or equal to 200bp, and determining a second candidate breakpoint; thus, DP signals paired in clusters near two breakpoints are preliminarily determined. And if the number of the sequences clustered at the first time is more than or equal to 5, clustering for the third time, if the difference between a plurality of second candidate breakpoints is more than 200bp, dividing the second candidate breakpoints into clusters, and returning to the sequence number more than or equal to 5 again to obtain the final DP signal.

If the SR and SU signal processing is marked, the DP will also be skipped.

The supported number of normal reads (as SR signal) is determined, with more than one side as the source of break point 1.

The chromosome, the breakpoint position, the chain direction and the fusion direction of the two breakpoints of the DP signal are determined, and then the type of structural variation is determined.

6. Result is written out

And writing the information of the structural variation determined based on the SR, SU and DP signals, including the chromosomes, positions and chain directions of the two breakpoints, the fusion direction, reads counting of the support signals, counting of normal reads, quality value information and the like, into a final result file.

Taking 1048 cases of panel samples (paying attention to 1992 positive SV sets in the samples), extracting DNA to build a library for chip capture test, comparing sequencing data to a human reference genome hg19, and analyzing by using the result variation comparison software of the embodiment, wherein the detection rate reaches 99.23%.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by computer programs. When all or part of the functions of the above embodiments are implemented by a computer program, the program may be stored in a computer-readable storage medium, and the storage medium may include: a read only memory, a random access memory, a magnetic disk, an optical disk, a hard disk, etc., and the program is executed by a computer to realize the above functions. For example, the program may be stored in a memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above may be implemented. In addition, when all or part of the functions in the above embodiments are implemented by a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and may be downloaded or copied to a memory of a local device, or may be version-updated in a system of the local device, and when the program in the memory is executed by a processor, all or part of the functions in the above embodiments may be implemented.

The present invention has been described in terms of specific examples, which are provided to aid understanding of the invention and are not intended to be limiting. For a person skilled in the art to which the invention pertains, several simple deductions, modifications or substitutions may be made according to the idea of the invention.

Claims

1. A method for detecting structural variation, comprising:

2. The method of claim 1, wherein the maximum insert length is calculated as follows: calculating the mean value and the variance of the lengths of the insertion segments which accord with the normal distribution, and recording the sum of the mean value and the triple variance as the length of the maximum insertion segment;

preferably, in the signal extraction step, the SU signal includes a sequence pair that simultaneously satisfies the following condition: flag is not a secondary or supplemental alignment; one of the two sequences is not aligned or the mate sequence is not aligned, and the non-aligned sequences in the two sequences are not repeatedly aligned;

preferably, in the signal extraction step, when the SU signal is extracted, writing out the sequence with the flag being the UNMAP in the sequence pair meeting the condition into a temporary file and sequencing, wherein the name of the sequence is named by the information of the mate sequence, and the information comprises the name, the starting position, the ending position and the comparison quality of the reference genome in the comparison;

preferably, in the signal extracting step, the DP signal comprises a sequence pair satisfying the following conditions at the same time: flag is not SE CONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, the length of the template strand corresponding to two different chromosomes or two sequences in an alignment is greater than the maximum insert length;

preferably, in the signal extraction step, when the DP signal is extracted, writing out the sequence pairs meeting the condition to a temporary file and sorting, wherein the names of the sequences are named by the information of their mate sequence, which includes the name, the start position, the end position, and the alignment quality of the reference genome on the alignment;

preferably, in the signal extraction step, the SR signal includes a sequence pair that does not satisfy the SU signal extraction condition and the DP signal extraction condition and satisfies the following conditions: the head or tail of the CIGAR is a soft clip type sequence;

preferably, in the signal extraction step, when the SR signal is extracted, soft clip information of a sequence meeting the conditions is written into the temporary text file, where the soft clip information includes the name and alignment position of a reference genome, alignment direction, soft clip sequence, base quality of the soft clip, and sequence alignment quality, where the soft clip is located at the head or tail of the soft clip, and the alignment position, alignment direction, and base quality of the soft clip;

preferably, the sequencing data comprises sequencing data aligned to a reference genome;

preferably, the reference genome comprises a human reference genome;

preferably, the reference genome comprises at least a portion of an hg19 genome, an hs37d5 genome, a b37 genome, an hg18 genome, an hg17 genome, an hg16 genome, or an hg38 genome;

preferably, the test sample comprises genomic DNA;

preferably, the sequencing data comprises region capture sequencing data.

3. The method as claimed in claim 1, wherein the SR signal processing includes processing soft clip information, and specifically includes the following steps:

merging step, including merging SOFT clip information on a plurality of sequences, defining the merged result as SOFT as SR signal, wherein the SOFT name comprises the name of the reference genome on the alignment, the alignment position, the head or tail of the sequence where the SOFT is positioned, the alignment direction, the SOFT count, the alignment quality and the SOFT serial number;

and a structural variation information extraction step, namely calculating the preliminary information of two candidate breakpoints according to the information of the sequence where the SOFT is located and the SOFT weight comparison information: and recording the information of the sequence where the SOFT is positioned as a first breakpoint, recording the SOFT realignment information as a second breakpoint, and determining structural variation information according to the information of the first breakpoint and the second breakpoint.

4. The method as claimed in claim 3, wherein the step of merging includes constructing an ordered set of soft clip information, using the qualified soft clip information as an independent SR signal;

preferably, in the merging step, preliminary sequencing is carried out according to the position information of the soft clip at the head or tail of the sequence and the reference genome in comparison, and the soft clips with the same name, comparison position and comparison direction of the reference genome in comparison, which are positioned at the head or tail of the soft clip, are used as a cluster to construct and obtain an ordered soft clip set;

preferably, the clusters comprise a soft clip cluster at the head of the forward sequence and a soft clip cluster at the tail of the reverse sequence;

preferably, for the soft clip cluster type at the head of the forward sequence, constructing a new soft clip set according to the reverse order, the deduplication count and the sequencing of the soft clip;

preferably, for the soft clip cluster type at the tail of the reverse sequence, a new soft clip set is constructed according to the original order, the duplicate removal counting and the sorting of the soft clip.

5. The method as claimed in claim 3, wherein in the merging step, after the ordered soft clip set is constructed, information meeting the following conditions is taken as an independent SR signal: comparing the front soft clip with the rear soft clip, and if the soft clip is shortened or has no prefix relation, taking the soft clip and the count thereof as independent SR signals; if the soft clip is lengthened and accords with the prefix relation, the longest soft clip and the count thereof are promoted to be used as independent SR signals;

preferably, after obtaining the independent SR signal, filtering to remove short sequences that do not meet the conditions, and defining each record in the obtained file as the SOFT;

preferably, the unconditionally short sequence comprises an SR signal read of less than 20 bp;

preferably, the step of comparing the ratios comprises performing single-ended ratio comparison on the SOFT obtained in the step of combining to obtain ratio comparison information;

preferably, in the structure variation information extraction step, two candidate breakpoints are calculated according to the information of the sequence where the SOFT is located and the SOFT weight comparison information, so as to determine the structure variation information;

preferably, in the structural variation information extraction step, the following sequences are extracted from the SOFT weight ratio information: flag is not the comparison result of SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, and a second breakpoint is calculated;

preferably, the preliminary information of the two candidate breakpoints includes the following information: the method comprises the following steps of (1) chromosome of a breakpoint, breakpoint position, chain direction of a sequence of a breakpoint SOFT, fusion direction of a sequence of the breakpoint SOFT, chain direction of SOFT re-comparison and fusion direction of SOFT re-comparison;

preferably, the direction of the strand and the direction of the fusion specifically include the following four cases:

3) after the SOFT FT from the tail part of the reverse sequence is subjected to repeated alignment, the forward alignment is carried out to the right side of a second breakpoint;

6. The method according to claim 3, wherein in the structural variation information extraction step, a DP signal that meets the conditions is found in the vicinity of two breakpoints as a complement, and a normal sequence that meets the conditions is found in the vicinity of the first breakpoint for calculating the structural variation frequency;

preferably, the step of extracting the structural variation information includes the steps of: the region extended by 10bp from the side where the SOFT is located, and the region extended by the maximum insert length-2X of the sequencing sequence length from the opposite side by the weight of the SOFT;

preferably, in the structural variation information extraction step, the DP signal that meets the condition includes: comparing chromosomes and comparing positions corresponding to two sequences of the DP type are consistent with a first breakpoint and a second breakpoint, and the chain directions are consistent, determining the quality value and the support number of the DP signal, not considering the found DP support during subsequent DP signal processing, if no DP support is found and the SOFT count is less than 2, indicating that the SR signal support evidence is insufficient, and filtering to remove the SR signal;

preferably, in the structural variation information extraction step, when the qualified DP signal requires consistent judgment, the chain direction of the mate sequence corresponding to the sequence in which the SOFT is located is taken as the chain direction near the first breakpoint;

preferably, the structural variation information extracting step includes the following steps of: the first breakpoint extends the maximum insertion segment length to the left to the region between the breakpoints;

preferably, in the structural variation information extraction step, the qualified normal sequences include: flag is not the alignment sequence of SECOND ARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, the support number of the normal sequence is determined according to the sequence number of the alignment initial position in the search interval without calculating the length of the template chain which is larger than the length of the maximum insertion segment or aligning to the sequences of different chromosomes.

7. The method according to claim 3, wherein during SR signal processing, further comprising a step of summarizing, specifically after the step of extracting structural variation information, if two breakpoints determined by a plurality of SOFTs are the same, merging into one breakpoint;

preferably, the method for merging breakpoints comprises the following steps: merging by taking the chromosome where the two breakpoints are located, the position of the breakpoints, the chain direction and the fusion direction of the sequence where the breakpoint SOFT is located and the chain direction and the fusion direction of SOFT realignment as indexes, and updating the SR signal support number.

8. The method of claim 1, wherein during SU signal processing, extracted SU signals are re-compared and then locally assembled to obtain structural variation information;

preferably, when processing the DP signal, the processing method of the DP signal includes: distinguishing between the forward and reverse strands, clustering the sequence of the DP signal, based on the fact that the forward sequence is located to the left of the possible structural variation break and the reverse sequence is located to the right of the possible structural variation break: comparing adjacent sequences, and finishing the first clustering if the same chromosome is compared and the difference of the comparison initial positions is less than or equal to 200bp, and determining a first candidate breakpoint; sequencing the result again according to a fixed sequence, clustering part of mate, comparing adjacent sequences, completing secondary clustering if the same chromosome is compared and the difference between the initial positions of comparison is less than or equal to 200bp, and determining a second candidate breakpoint; initially determining DP signals paired into clusters near the two candidate breakpoints;

preferably, in the processing method of the DP signal, there may be more than one second candidate breakpoint, if the number of sequences clustered at the first two times is greater than or equal to 5, clustering is performed for the third time, at this time, if the difference between a plurality of second candidate breakpoints is greater than or equal to 200bp, clustering is performed separately, and the number of sequences is returned again to be greater than or equal to 5, so as to obtain a final DP signal;

preferably, in the DP signal processing method, if the DP signal is already marked during the SR signal processing and the SU signal processing, the DP signal is skipped during the DP signal processing;

preferably, in the DP signal processing method, the number of supported normal sequences is determined, and a plurality of sides are used as sources of the first breakpoint;

preferably, in the DP signal processing method, after determining the chromosome, the breakpoint position, the strand direction, and the fusion direction where the two breakpoints of the DP signal are located, the type of structural variation is obtained;

preferably, after all signal processing is completed, structural variation information is obtained, and the structural variation information includes the following information: the chromosome, position and strand direction of the two breakpoints, the fusion direction, the sequence count of the support signal, the count of the normal sequence and the quality value information are written as the final result.

9. An apparatus for detecting structural variations, comprising:

10. An apparatus, comprising:

a memory for storing a program;

a processor for implementing the method of any one of claims 1 to 8 by executing a program stored by the memory.

11. A computer-readable storage medium, characterized in that the medium has stored thereon a program which is executable by a processor to implement the method according to any one of claims 1 to 8.