CN114464252B

CN114464252B - Method and device for detecting structural variation

Info

Publication number: CN114464252B
Application number: CN202210093787.5A
Authority: CN
Inventors: 刘涛; 苏亚男; 何俊义; 吴永鑫; 方欢
Original assignee: Shenzhen Guiinga Medical Laboratory
Current assignee: Shenzhen Guiinga Medical Laboratory
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2023-06-27
Anticipated expiration: 2042-01-26
Also published as: CN114464252A

Abstract

A method and device for detecting structural variation, the method for detecting structural variation comprises: a calculation step, which comprises calculating the maximum insert length of a sequencing library of a sample to be detected; the signal extraction step comprises extracting SU signals, DP signals and SR signals from sequencing data of a sample to be detected; and the signal processing step comprises the step of processing the SR signal, the SU signal and the DP signal to obtain the structural variation information. The invention can support the detection of the structural variation of tumor genome by extracting and processing three signals.

Description

Method and device for detecting structural variation

Technical Field

The invention relates to the field of bioinformatics, in particular to a method and a device for detecting structural variation.

Background

Genomic structural variation (Structure Variations, SV) refers to sequence and position changes of a large length (50 bp or more) occurring on a genome, and includes several types of insertion (insertion), deletion (duplication), inversion (inversion), translocation (transformation), etc., which are important sources of genomic variation, and are related to evolution, genetic disease, tumor, etc.

The second generation sequencing technology is commonly used in a double end (PE) sequencing type, that is, two sequences (reads) corresponding to the same template strand (template) are respectively measured in fixed lengths, and are read1 and read2 (mate) respectively, and the original sequences can be identified by the comparison software analysis on the positions of the original sequences on the reference genome, and are well matched under normal conditions, and the comparison position difference of the two sequences does not exceed the distribution of the template strand length, but some comparison exists, such as: the alignment of sequences in several parts to different positions of the reference genome (cut alignment, split reads mapping, SR), alignment of sequence ends to other positions of the reference genome or to any position of the reference genome not aligned (soft clip, SC), normal alignment of one of the two sequences to the other not aligned (SU), alignment of the two sequences to two chromosomes or far apart in alignment (anomaly alignment, DP), sequence coverage depth anomalies at the alignment (coverage, depth of coverage, DOC), etc., can be used as signals to support detection of structural variation.

With these signals derived from the original reads alignment information, how to reverse the possible structural variation events, including the occurring chromosomes, breakpoints, orientations, etc., is a problem to be solved by the theoretical method and software application of detecting structural variation developed based on second generation sequencing for the last fifteen years, and the differences of these software are mainly expressed in: the type of the extracted signal, the processing method of the signal, whether to Assemble (AS) and the like.

Disclosure of Invention

According to a first aspect, in an embodiment, there is provided a method of detecting structural variation, comprising:

a calculation step, which comprises calculating the maximum insert length of a sequencing library of a sample to be detected;

the signal extraction step comprises extracting SU signals, DP signals and SR signals from sequencing data of a sample to be detected;

and the signal processing step comprises the step of processing the SR signal, the SU signal and the DP signal to obtain the structural variation information.

According to a second aspect, in one embodiment, there is provided an apparatus for detecting structural variations, comprising:

the calculation module is used for calculating the maximum insert length of the sequencing library of the sample to be detected;

the signal extraction module is used for extracting SU signals, DP signals and SR signals from sequencing data of the sample to be detected;

And the signal processing module is used for processing the SR signal, the SU signal and the DP signal to obtain the structural variation information.

According to a third aspect, in an embodiment, there is provided an apparatus comprising:

a memory for storing a program;

a processor for implementing the method of the first aspect by executing the program stored in the memory.

According to a fourth aspect, in an embodiment, a computer readable storage medium is provided, having stored thereon a program executable by a processor to implement the method of the first aspect.

According to the method and the device for detecting structural variation in the embodiment, the detection of the structural variation of the tumor genome can be supported by extracting and processing three signals.

Drawings

Fig. 1 is a flow chart of detecting structural variation information in embodiment 1.

Detailed Description

The present invention will be described in further detail with reference to the following specific embodiments. Wherein like elements in different embodiments are numbered alike in association. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted, or replaced by other elements, materials, or methods in different situations. In some instances, some operations associated with the present application have not been shown or described in the specification to avoid obscuring the core portions of the present application, and may not be necessary for a person skilled in the art to describe in detail the relevant operations based on the description herein and the general knowledge of one skilled in the art.

Furthermore, the described features, operations, or characteristics of the description may be combined in any suitable manner in various embodiments. Also, various steps or acts in the method descriptions may be interchanged or modified in a manner apparent to those of ordinary skill in the art. Thus, the various orders in the description are for clarity of description of only certain embodiments, and are not meant to be required, unless otherwise indicated, to be followed.

The numbering of the components itself, e.g. "first", "second", etc., is used herein merely to distinguish between the described objects and does not have any sequential or technical meaning.

Interpretation of the terms

As used herein, structural variation (Structure Variations, SV) refers to sequence and position changes of a large length (50 bp or more) occurring on a genome, and includes several types of insertion (insertion), deletion (duplication), inversion (translocation), translocation (transformation), etc., which are important sources of genome variation, and are related to evolution, genetic disease, tumor, etc.

A calculating step including calculating a maximum insert length;

a signal extraction step, which comprises extracting SU signals (single unmapped, SU, single alignment information), DP signals (DP, anomaly pair information), SR signals (split reads mappi ng, cut pair information) from sequencing data of a sample to be tested;

In one embodiment, the maximum insert length is calculated as follows: including calculating the mean and variance for insert lengths that meet a normal distribution, the sum of the mean and the three times variance being noted as the maximum insert length, the statistic being sample-specific.

In an embodiment, in the signal extraction step, the SU signal includes a sequence pair that satisfies the following condition simultaneously: flag is not a secondary alignment or a supplementary alignment; one of the two sequences is unaligned or its mate sequence is unaligned, and the unaligned sequences in the two sequences are not aligned repeatedly.

In an embodiment, in the signal extraction step, when SU signals are extracted, a sequence whose flag is un AP in the pair of sequences meeting the condition is written out to a temporary file and ordered, wherein the name of the sequence is named by the information of its mate sequence, which contains the name, start position, end position, and alignment quality of the reference genome on the alignment.

In an embodiment, in the signal extraction step, the DP signal includes a sequence pair that satisfies the following conditions simultaneously: the flag is not SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, and the length of the template chain corresponding to two different chromosomes or two sequences in the alignment is greater than the maximum insert length.

In one embodiment, in the signal extraction step, when DP signals are extracted, pairs of sequences meeting the conditions are written out to a temporary file and ordered, wherein the names of the sequences are named by the information of their mate sequences, which contains the names, start position, end position, alignment quality of the reference genome on the alignment.

In an embodiment, in the signal extraction step, the SR signal includes a sequence pair that does not satisfy SU signal extraction conditions and DP signal extraction conditions, and satisfies the following conditions: the head or tail of cigar is a soft clip type sequence.

In an embodiment, in the signal extraction step, when the SR signal is extracted, soft clip information of the sequence meeting the condition is written out to the temporary text file, wherein the soft clip information includes a name and an alignment position of a reference genome on which the soft clip is located at the head or tail, the alignment direction, the soft clip sequence, the soft clip base quality, and the sequence alignment quality.

In one embodiment, the sequencing data comprises sequencing data aligned to a reference genome.

In one embodiment, the reference genome comprises a human reference genome.

In one embodiment, the reference genome comprises at least a portion of an hg19 (also known as GRCh 37) genome, an hs37d5 genome, a b37 genome, an hg18 genome, an hg17 genome, an hg16 genome, or an hg38 genome.

In one embodiment, the test sample comprises genomic DNA.

In one embodiment, the sequencing data comprises region capture sequencing data (also known as targeted capture sequencing data).

In one embodiment, the SR signal processing includes processing of soft clip information, and specifically includes the following steps:

combining, namely combining SOFT clip information on a plurality of sequences, wherein the combined result is defined as SOFT, and the SOFT name comprises the name of a reference genome on alignment, the alignment position, the head or tail of the sequences, the alignment direction, the SOFT count, the alignment quality and the SOFT sequence number;

a heavy comparison step, comprising the step of carrying out heavy comparison on the SOFT obtained in the merging step to obtain heavy comparison information;

the structural variation information extraction step is used for calculating the preliminary information of two candidate break points according to the information (front: original comparison) of the sequence in which the SOFT is positioned and the SOFT re-comparison information (back): and marking the information of the sequence where the SOFT is positioned as a first breakpoint, marking the SOFT comparison information as a second breakpoint, and determining the structural variation information according to the information of the first breakpoint and the second breakpoint.

In one embodiment, the merging step includes constructing an ordered set of soft clip information, and using the qualified soft clip information as the independent SR signal.

In an embodiment, in the merging step, the position information of the reference genome on the alignment is initially ordered according to the position information of the reference genome on the head or tail of the sequence, and the name, the alignment position and the alignment direction of the reference genome on the head or tail of the sequence are all the same, so that an ordered soft clip set is constructed.

In one embodiment, the clusters include a soft clip cluster at the front of forward reads and a soft clip cluster at the rear of reverse reads.

In one embodiment, for the soft clip cluster type of the forward sequence header, a new soft clip set is constructed after the soft clips are in reverse order, de-duplicated count, and ordering.

In one embodiment, for the soft clip cluster type at the tail of the reverse sequence, a new soft clip set is constructed after soft clip sequence, de-duplication count and sequencing.

In one embodiment, in the merging step, after the ordered soft clip set is constructed, the information meeting the following conditions is used as an independent SR signal: comparing front and back soft clips, and if the soft clips are shortened or have no prefix relation, taking the soft clip and the count thereof as independent SR signals; if the soft clip is longer and matches the prefix, the longest soft clip and its count are lifted as independent SR signals.

In one embodiment, after obtaining the independent SR signal, the short sequences that do not meet the conditions are filtered out, and each record in the resulting file is defined as the SOFT.

In one embodiment, the unconditionally short sequence comprises an SR signal read of less than 20 bp.

In an embodiment, the step of comparing the weights includes performing single-ended weight comparison on the SOFT obtained in the step of combining to obtain the weight comparison information.

In an embodiment, in the step of extracting the structural variation information, two candidate break points are calculated according to two parts of the information (original alignment) of the sequence in which the SOFT is located and the SOFT re-alignment information, so as to determine the structural variation information.

In one embodiment, in the step of extracting structural variation information, the following sequences are extracted from the SOFT alignment information: the flag is not SECONDARY, SUPPLEMENTARY, DUP, UMAP or the comparison of MUNMAP, the second breakpoint is calculated.

In an embodiment, the preliminary information of the two candidate breakpoints includes the following information: the chromosome where the breakpoint is located, the breakpoint position, the strand direction of the sequence where the breakpoint SOFT is located (is_reverse), the fusion direction of the sequence where the breakpoint SOFT is located (is_gene), the strand direction of SOFT alignment (is_reverse), and the fusion direction of SOFT alignment (is_gene).

In one embodiment, the chain direction and the fusion direction specifically include the following four cases:

1) After SOFT realignment of forward sequence header source, forward alignment is to the left of the second breakpoint;

2) After SOFT alignment of the forward sequence header source, reverse complementation alignment is carried out to the right side of the second breakpoint;

3) After SOFT alignment of the reverse sequence tail source, forward alignment is carried out to the right side of the second breakpoint;

4) After a SOFT realignment of the reverse sequence tail origin, the reverse complement alignment is to the left of the second breakpoint.

In an embodiment, the details of the occurrence of the structural variation may also be determined, for example, the type of the structural variation may be determined by the chromosome in which the breakpoint is located, the breakpoint position, and the breakpoint SOFT alignment chain.

In one embodiment, the step of extracting structural variation information includes: the region 10bp extended on the side of the SOFT and the region of the SOFT weight versus the length of the sequencing sequence length of the side extended maximum insert length-2. The two regions are used as search intervals.

In an embodiment, in the step of extracting the structural variation information, the DP signal conforming to the condition is found in the area around the two break points as a supplement, and the normal sequence conforming to the condition is found in the area around the first break point, so as to calculate the structural variation frequency.

In an embodiment, in the step of extracting structural variation information, the DP signal satisfying the condition includes: the comparison chromosome and the comparison position corresponding to the two sequences of the DP type are consistent with the first breakpoint and the second breakpoint, the chain direction is consistent, the quality value and the support number of the DP signals are determined, the found DP support is not considered any more in the subsequent DP signal processing, if the DP support is not found and the SOFT count is less than 2, the SR signal is too weak, namely the SR signal support evidence is insufficient, and the SR signal can be filtered and removed.

In an embodiment, in the step of extracting the structural variation information, when the DP signal request chain direction meeting the condition is consistent, a mate reads chain direction corresponding to the sequence where the SOFT is located is used as a chain direction near the first breakpoint.

In an embodiment, in the step of extracting structural variation information, the area around the first breakpoint for searching for the normal sequence includes: the first break point extends the maximum insert length to the left to the region between the break points.

In an embodiment, in the step of extracting structural variation information, the normal sequence meeting the condition includes: the flag is not SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP aligned sequences (reads), sequences (reads) with template strand length greater than the maximum insert length or aligned to different chromosomes are not calculated, and the number of sequences in the search interval at the initial position of alignment is used to determine the number of sequences supported by the normal sequence.

In an embodiment, the SR signal processing further includes a summarizing step, specifically, after the structural variation information extracting step, two break points determined by multiple SOFT may be identical, and need to be combined into one. If two breakpoints determined by multiple SOFT are the same, the breakpoints are combined into one breakpoint.

In one embodiment, a method of merging breakpoints includes: and merging the chromosome where the two breakpoints are located, the breakpoint position, the chain direction (is_reverse) and the fusion direction (is_gene) of the sequence where the breakpoint SOFT is located, and the chain direction (is_reverse) and the fusion direction (is_gene) of SOFT weight comparison by taking the chromosome as indexes, and updating the support number of SR signals.

In one embodiment, during SU signal processing, the extracted SU signals are re-aligned and then partially assembled to obtain structural variation information. Since there are fewer sequences back to the reference genome after assembly, there are fewer structural variations detected after SU signal processing.

In one embodiment, when processing a DP signal, the processing method of the DP signal includes: forward sequences (forward reads) are located to the left of the possible structural variation breakpoint (typically most of the forward sequence is located to the left of the possible structural variation breakpoint), reverse sequences (reverse reads) are located to the right of the possible structural variation breakpoint (typically most of the reverse sequence is located to the right of the possible structural variation breakpoint), and based on these two rules, the forward and reverse strands are distinguished, and the sequences of the DP signal (which have been ordered based on alignment initiation position) are clustered: comparing adjacent sequences, if the comparison is carried out on the same chromosome and the difference between the comparison initial positions is less than or equal to 200bp, completing first clustering, and determining a first candidate breakpoint at the moment; re-sequencing the results according to a fixed sequence (comparing the comparison position with the mate comparison position, sequencing according to the whole genome level, possibly clustering the mates corresponding to the mates, and sequencing), clustering the mates partially, comparing adjacent sequences, and if the comparison is carried out on the same chromosome and the comparison initial position is different by less than or equal to 200bp, completing second clustering, and determining a second candidate breakpoint at the moment; the DP signals paired into clusters near the two candidate break points are initially determined.

In an embodiment, in the processing method of the DP signal, there may be more than one second candidate breakpoint, if the number of sequences after clustering in the previous two times is greater than or equal to 5, clustering is performed for the third time, and if the difference between the second candidate breakpoints is greater than 200bp, the second candidate breakpoints are separated into clusters, and the number of sequences is greater than or equal to 5 again, so as to obtain the final DP signal.

In one embodiment, the processing method of the DP signal is that if the DP signal is marked during the SR signal processing and the SU signal processing, the DP signal is skipped during the DP signal processing.

In an embodiment, in the processing method of the DP signal, the number of supported normal sequences (same as the SR signal) is determined, and more sides are used as the source of the first break point (break point 1).

In one embodiment, in the processing method of the DP signal, the type of structural variation is obtained after the chromosome, the breakpoint position, the chain direction, and the fusion direction of the two breakpoints of the DP signal are determined.

In one embodiment, after all signal processing is completed, structural variation information is obtained, where the structural variation information includes the following information: the chromosome, the position and the chain direction where the two break points are located, the fusion direction, the sequence count of the supporting signals, the count of the normal sequences and the quality value information.

a memory for storing a program;

In one embodiment, the invention can detect structural variations of the genome, supporting detection of structural variations of the tumor genome.

In one embodiment, the invention is divided into two aspects of signal extraction and signal processing, comprising three types of signals of SR, SU and DP, and determining information of chromosome, breakpoint, direction and the like of structural variation based on the signals.

Example 1

Fig. 1 is a flow chart of detecting structural variation information in embodiment 1, which includes the following steps:

1. calculation of maximum insert Length

The mean and variance are calculated for insert lengths that fit a normal distribution, the sum of the mean and the three times variance is noted as the maximum insert length, and the statistic is sample specific.

2. The comparison file extraction signal is based on the comparison condition that each row in the comparison file represents each sequence (reads), the comparison file comprises second column information flag of the comparison file, fourth column information CIGAR of the comparison file and the like, and the signals are respectively defined as SR, SU and DP after extraction.

2.1 SU signals extracted in this embodiment refer to ready pairs that simultaneously satisfy the following conditions: the flag is not SECONDARY alignment (SECONDARY, one reads may be aligned to multiple positions on the chromosome, primary alignment is primary) or SUPPLEMENTARY alignment (SUPPLEMENTARY alignment, most of the alignment is normal to be representational, other part of the alignment is elsewhere due to the mosaic alignment), one flag of the two reads is un-aligned (UNMAP) or its materaads is not necessarily aligned (nmap), and the un-aligned sequence of the two reads is not repeated alignment (DUP), one reads with a qualified flag is UNMAP is written to a temporary bank file and ordered, wherein the names of the reads are named with information of its mate reads, which contains the name, starting position, ending position, alignment quality of the reference genome on the alignment.

2.2 the DP signal extracted in this embodiment refers to a ready pair that satisfies the following conditions simultaneously: the flag is not SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, two reads are aligned to have two different chromosomes or the template chain length corresponding to the two reads is greater than the maximum insert length, and the eligible reads pairs are written out to a temporary bam file and ordered, wherein the names of the reads are named with the information of the mate reads, which contains the name, start position, end position, alignment quality of the reference genome on the alignment.

2.3 the SR signal extracted in this example refers to a ready pair that does not satisfy the SU and DP extraction conditions, and satisfies the following conditions: the head or tail of the cigar is reads of the soft clip type, and soft clip information of the reads meeting the conditions is written into a temporary text file, wherein the soft clip information comprises the names and alignment positions of reference genome on the head or tail and alignment, alignment directions, soft clip sequences, soft clip base quality and reads alignment quality of the soft clip.

3. Processing of soft clip information

3.1 combining Soft clip information to construct SR Signal

The temporary text file containing the soft clip information is preliminarily ordered according to the position information of the reference genome of which the soft clip is positioned at the head or the tail and is aligned, the name, the alignment position and the alignment direction of the reference genome of which the soft clip is positioned at the head or the tail and is aligned are taken as a cluster, and the clusters can be divided into two types: a new soft clip set is constructed after the soft clip cluster type of the forward sequence (forward reads) header is counted and ordered according to the reverse order and the duplicate removal of the soft clip; and (3) constructing a new soft clip set according to the soft clip original sequence, the deduplication count and the sequencing of the soft clip cluster type at the tail of the reverse sequences (reverse reads).

The constructed ordered soft clip set can be used as an independent SR signal according to the following conditions: comparing the front and back soft clips, if the soft clip is shortened or has no prefix relation, taking the soft clip and the count (1) thereof as independent SR signals; if the soft clip is longer and accords with the prefix relation, the longest soft clip and the count thereof (the number of soft clips containing the prefix relation is more than 1) are lifted as independent SR signals. After filtering out the part smaller than 20bp of the SR signal, writing the part into a temporary fastq file, wherein each record in the temporary fastq file is defined as SOFT to distinguish the original SOFT clip before merging, wherein the SOFT name comprises the name and the comparison position of a reference genome on comparison, the SOFT is positioned at the head or tail, the comparison direction, the SOFT count, the comparison quality and the number of SOFT. The process also records the count and length results of all SOFT's, the index of which SOFT is used for the subsequent de-duplication combination.

3.2 SOFT of temporary fastq File Single-ended heavy alignment

3.3 extracting detailed structural variation information from the re-alignment results

Irrespective of the comparison result that the flag is SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, the preliminary information of two candidate break points can be calculated from the two information of the ready information (noted as break point1, i.e. the first break point) and the SOFT re-comparison information (noted as break point2, i.e. the second break point) where the SOFT is located: the chromosome where the breakpoint is located, the breakpoint position, the strand direction (is_reverse) and the fusion direction (is_gene) of the comparison of the reads where the breakpoint SOFT is located and the SOFT weight can be divided into four cases (as shown in table 1, 0 indicates no, 1 indicates yes): the SOFT weight of the source of forward reads header (left) is followed by forward alignment to the left of break point2 or reverse complement alignment to the right of break point2, the SOFT weight of the source of reverse reads tail (right) is followed by forward alignment to the right of break point2 or reverse complement alignment to the left of break point2, and further details of structural variation occurrence can be determined, such as the type of structural variation determined by the chromosome where the breakpoint is located, the breakpoint location, the breakpoint SOFT alignment chain orientation.

TABLE 1SOFT

The SOFT and the realignment information determine the SR signal of the structural variation, and the DP signal meeting the conditions is searched for as supplement in the region (10 bp extended on the side where the SOFT is located, 2 times sequencing the length of the ready and the maximum insert length extended on the side where the SOFT is aligned as a search interval) near the two break points: the aligned chromosomes and the aligned positions corresponding to the two reads of the DP type are consistent with the break point1 and the break point2, the chain direction is consistent (the matrix reads chain direction of the reads where the SOFT is located is used as the chain direction near the break point 1), the quality value and the support number of the DP signal are determined, the found DP support is not considered in the process of the DP signal at the back, and if the DP support is not found and the SOFT count is less than 2, the SR signal is filtered.

Finding normal reads meeting the condition in the area near break point1 (the maximum insert length to the left flank at the breakpoint to the breakpoint as a search interval): the number of reads that were aligned to different chromosomes or that were aligned to different chromosomes with a template chain length greater than the maximum insert length was determined based on the number of reads whose aligned start position was within the search interval without calculating flag as SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP.

3.4 summary

Since the two break points determined by multiple SOFT may be identical, it is necessary to merge into one: and merging by taking the chromosome where the two break points are located, the breakpoint position, the strand direction (is_reverse) where the breakpoint SOFT is located and the SOFT are in heavy comparison, and the fusion direction (is_gene) as indexes, and updating the support number of SR signals.

4. Processing SU signals

And (5) comparing the extracted SU signals again, and then performing local assembly to obtain structural variation information. Since there are fewer sequences back to the reference genome after assembly, there are fewer structural variations detected after SU signal processing.

5. Processing of DP signals

The alignment around structural variation is: forward reads are to the left of the possible structural variation breakpoint and reverse reads are to the right of the possible structural variation breakpoint, based on which the reads of the forward and reverse links to DP signals (which have been ordered based on alignment start positions) are clustered: comparing adjacent reads, if the comparison is carried out on the same chromosome and the difference of the comparison initial positions is less than or equal to 200bp, completing first clustering, and determining a first candidate breakpoint at the moment; re-sequencing the results according to a fixed sequence (comparing the comparison position with the mate comparison position, sequencing according to the whole genome level, possibly clustering the mates corresponding to the mates, and sequencing), clustering the mates partially, comparing adjacent sequences, and if the comparison is carried out on the same chromosome and the comparison initial position is different by less than or equal to 200bp, completing second clustering, and determining a second candidate breakpoint at the moment; in this way, the paired clustered DP signals near the two break points are initially determined. If the number of sequences after the previous two times of clustering is more than or equal to 5, performing third clustering, if the difference of a plurality of second candidate breakpoints is more than 200bp, dividing the second candidate breakpoints into clusters, and returning the number of sequences again to be more than or equal to 5, thereby obtaining a final DP signal.

The DP will also be skipped if the SR and SU signal processing is marked.

The number of normal reads supported (same as SR signal) is determined, and more sides are used as the source of break point 1.

The chromosome, breakpoint position, chain direction and fusion direction of two breakpoints of the DP signal are determined, so that the type of structural variation is determined.

6. Result write out

And writing the information of structural variation determined based on SR, SU and DP signals, including the chromosome, position and chain direction where the two break points are located, fusion direction, reads count of supporting signals, count of normal reads, quality value information and the like, into a final result file.

Taking 1048 cases of panel samples (focusing on 1992 positive SV sets), extracting DNA library, performing chip capture test, comparing sequencing data with human reference genome hg19, and analyzing by using the result mutation comparison software of the embodiment, wherein the detection rate reaches 99.23%.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.

The foregoing description of the invention has been presented for purposes of illustration and description, and is not intended to be limiting. Several simple deductions, modifications or substitutions may also be made by a person skilled in the art to which the invention pertains, based on the idea of the invention.

Claims

1. A method of detecting structural variation, comprising:

in the signal extraction step, the SU signal includes a sequence pair that satisfies the following conditions simultaneously: flag is not a secondary alignment or a supplementary alignment; one of the two sequences is unaligned or the mate sequence thereof is unaligned, and the unaligned sequences in the two sequences are not aligned repeatedly;

in the signal extraction step, the DP signal includes a sequence pair satisfying the following conditions simultaneously: flag is not SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, the length of the template chain corresponding to two different chromosomes or two sequences in the alignment is greater than the maximum insert length;

In the signal extraction step, the SR signal includes a sequence pair that does not satisfy SU signal extraction conditions and DP signal extraction conditions, and satisfies the following conditions: the head or tail of CIGAR is a soft clip type sequence;

a signal processing step, which comprises the steps of processing SR signals, SU signals and DP signals to obtain structural variation information;

when the DP signal is processed, the processing method of the DP signal comprises the following steps: according to the fact that the forward sequence is located on the left side of the possible structural variation breakpoint and the reverse sequence is located on the right side of the possible structural variation breakpoint, the forward chain and the reverse chain are distinguished, and the sequences of the DP signals are clustered: comparing adjacent sequences, if the comparison is carried out on the same chromosome and the difference between the comparison initial positions is less than or equal to 200bp, completing first clustering, and determining a first candidate breakpoint at the moment; sequencing the results again according to a fixed sequence, clustering the mate part, comparing adjacent sequences, and if the comparison is carried out on the same chromosome and the comparison initial position is different by less than or equal to 200bp, completing second clustering, and determining a second candidate breakpoint; the DP signals paired into clusters near the two candidate break points are preliminarily determined;

in the DP signal processing method, more than one second candidate breakpoint may be used, if the number of sequences after the previous two clustering is more than or equal to 5, the third clustering is performed, and if the difference of a plurality of second candidate breakpoints is more than 200bp, the second candidate breakpoints are separated into clusters, and the number of sequences is more than or equal to 5 again, so that the final DP signal is obtained.

2. The method of claim 1, wherein the maximum insert length is calculated as follows: includes calculating a mean and variance for insert lengths that meet a normal distribution, the sum of the mean and the three times variance being noted as the maximum insert length.

3. The method of claim 1, wherein in the signal extraction step, when SU signals are extracted, a sequence whose flag is UNMAP in the pair of sequences that meets the condition is written to a temporary file and ordered, wherein the name of the sequence is named by information of its mate sequence, which information contains the name, start position, end position, alignment quality of the reference genome on the alignment.

4. The method of claim 1, wherein in the signal extracting step, when extracting the DP signal, pairs of sequences meeting the conditions are written to the temporary file and ordered, wherein the names of the sequences are named by information of their mate sequences, which information contains the names, start position, end position, alignment quality of the reference genome on the alignment.

5. The method of claim 1, wherein in the signal extraction step, when the SR signal is extracted, soft clip information of a sequence conforming to the condition is written to the temporary text file, wherein the soft clip information includes a name and an alignment position of a reference genome on which the soft clip is located at a header or a trailer, an alignment direction, a soft clip sequence, a soft clip base quality, and a sequence alignment quality.

6. The method of claim 1, wherein the sequencing data comprises sequencing data aligned to a reference genome.

7. The method of claim 6, wherein the reference genome comprises a human reference genome.

8. The method of claim 7, wherein the reference genome comprises at least one of an hg19 genome, an hs37d5 genome, a b37 genome, an hg18 genome, an hg17 genome, an hg16 genome, or an hg38 genome.

9. The method of claim 1, wherein the test sample comprises genomic DNA.

10. The method of claim 1, wherein the sequencing data comprises region capture sequencing data.

11. The method of claim 1, wherein the SR signal processing includes processing soft clip information, and specifically includes the steps of:

the structural variation information extraction step is used for calculating the preliminary information of two candidate break points according to the information of the sequence where the SOFT is located and the SOFT comparison information: and marking the information of the sequence where the SOFT is positioned as a first breakpoint, marking the SOFT comparison information as a second breakpoint, and determining the structural variation information according to the information of the first breakpoint and the second breakpoint.

12. The method of claim 11, wherein the merging step includes constructing an ordered set of soft clip information, and using the qualified soft clip information as the independent SR signal.

13. The method of claim 12 wherein in the merging step, the primary ordering is performed according to the position information of the reference genome on the head or tail of the sequence and the alignment of the information of the soft clips, and the names, the alignment positions and the alignment directions of the reference genome on the head or tail and the alignment of the soft clips are all the same as one cluster, so that an ordered soft clip set is constructed.

14. The method of claim 13, wherein the clusters comprise a soft clip cluster for a forward sequence header and a soft clip cluster for a reverse sequence trailer.

15. The method of claim 14, wherein for a soft clip cluster type of the forward sequence header, a new set of soft clips is constructed after reverse order, de-duplication count, and ordering of soft clips.

16. The method of claim 14, wherein for a soft clip cluster type at the tail of the reverse sequence, a new set of soft clips is constructed after soft clip primordial, de-duplication count, and sequencing.

17. The method of claim 11, wherein in the merging step, after the ordered soft clip set is constructed, information meeting the following conditions is used as the independent SR signal: comparing front and back soft clips, and if the soft clips are shortened or have no prefix relation, taking the soft clips and the counts thereof as independent SR signals; if the soft clip is longer and matches the prefix, the longest soft clip and its count are lifted as independent SR signals.

18. The method of claim 17, wherein after obtaining the independent SR signal, filtering to remove unconditionally short sequences, defining each record in the resulting file as the SOFT.

19. The method of claim 18, wherein the unconditionally short sequence comprises an SR signal reading of less than 20 bp.

20. The method of claim 17, wherein the step of re-comparing includes performing single-ended re-comparison on the SOFT obtained in the step of combining to obtain re-comparison information.

21. The method of claim 20 wherein in the step of extracting structural variation information, two candidate break points are calculated according to the information of the sequence in which the SOFT is located and the SOFT re-alignment information, so as to determine the structural variation information.

22. The method of claim 21, wherein in the structural variation information extracting step, the following sequences are extracted from SOFT alignment information: the flag is not SECONDARY, SUPPLEMENTARY, DUP, UMAP or the comparison of MUNMAP, the second breakpoint is calculated.

23. The method of claim 21, wherein the preliminary information of the two candidate breakpoints includes information as follows: the chromosome where the breakpoint is located, the breakpoint position, the chain direction of the sequence where the breakpoint SOFT is located, the fusion direction of the sequence where the breakpoint SOFT is located, the chain direction of SOFT weight comparison and the fusion direction of SOFT weight comparison.

24. The method of claim 23, wherein the chain direction and the fusion direction specifically comprise the following four cases:

25. The method of claim 11, wherein in the structural variation information extracting step, a qualified DP signal is found in a region near two break points as a supplement, and a qualified normal sequence is found in a region near a first break point for calculating a structural variation frequency.

26. The method of claim 25, wherein in the structure variation information extracting step, the searching for the two break point vicinity areas used for the DP signal includes: the region 10bp extended on the side of the SOFT and the region of the SOFT weight versus the length of the sequencing sequence length of the side extended maximum insert length-2.

27. The method of claim 25, wherein in the structural variation information extracting step, the eligible DP signal comprises: and the comparison chromosomes and the comparison positions corresponding to the two sequences of the DP type are consistent with the first breakpoint and the second breakpoint, the chains are consistent, the quality value and the support number of the DP signals are determined, the found DP signal support is not considered any more in the subsequent DP signal processing, if the DP signal support is not found and the SOFT count is less than 2, the defect of insufficient support evidence of the SR signal is indicated, and the SR signal is filtered and removed.

28. The method of claim 27, wherein in the step of extracting structural variation information, when the DP signal meeting the condition requires a match determination, a match sequence corresponding to a sequence in which a SOFT is located is used as a first break point nearby chain.

29. The method of claim 25, wherein the step of extracting structural variation information includes: the first break point extends the maximum insert length to the left to the region between the break points.

30. The method of claim 25, wherein in the step of extracting structural variation information, the normal sequence satisfying the condition includes: the flag is not SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, sequences with template strand length longer than the maximum insert length or aligned to different chromosomes are not calculated, and the number of sequences in the search interval at the initial position of alignment is used for determining the number of supports of the normal sequence.

31. The method of claim 11 further comprising a step of summarizing the SR signal, particularly after the step of extracting structural variation information, if two break points determined by the plurality of SOFT are identical, the two break points are merged into one break point.

32. The method of claim 31, wherein the method of merging breakpoints comprises: and merging the indexes of the chromosome where the two break points are located, the breakpoint position, the chain direction and the fusion direction of the sequence where the breakpoint SOFT is located and the chain direction and the fusion direction of SOFT weight comparison, and updating the support number of the SR signals.

33. The method of claim 1, wherein the extracted SU signals are partially assembled after being re-aligned during SU signal processing to obtain structural variation information.

34. The method of claim 1, wherein in the processing method of the DP signal, if the DP signal is marked during the SR signal processing and the SU signal processing, the DP signal is skipped during the DP signal processing.

35. The method of claim 34 wherein in the processing method of the DP signal, the number of supported normal sequences is determined, and more sides are used as the source of the first breakpoint;

the normal sequences that meet the conditions include: the flag is not SECONDARY, SUPPLEMENTARY, DUP, UMAP or MUNMAP, sequences with template strand length longer than the maximum insert length or aligned to different chromosomes are not calculated, and the number of sequences in the search interval at the initial position of alignment is used for determining the number of supports of the normal sequence.

36. The method of claim 34 wherein the DP signal is processed by determining the chromosome, breakpoint location, orientation, and fusion direction of the two breakpoints of the DP signal, and obtaining the type of structural variation.

37. The method of claim 34, wherein after all signal processing is completed, structural variation information is obtained, the structural variation information comprising: the chromosome, the position and the chain direction where the two break points are located, the fusion direction, the sequence count of the supporting signals, the count of the normal sequence and the quality value information are written out as final results.

38. An apparatus for detecting structural variations, comprising:

the signal processing module is used for processing the SR signal, the SU signal and the DP signal to obtain structural variation information;

39. An apparatus, comprising:

a memory for storing a program;

a processor configured to implement the method according to any one of claims 1 to 37 by executing a program stored in the memory.

40. A computer readable storage medium having stored thereon a program executable by a processor to implement the method of any one of claims 1 to 37.