CN111081318A

CN111081318A - Fusion gene detection method, system and medium

Info

Publication number: CN111081318A
Application number: CN201911243763.8A
Authority: CN
Inventors: 曾华萍; 传军; 吴桂枝; 王益民
Original assignee: Genetalks Bio Tech Changsha Co ltd
Current assignee: Genetalks Bio Tech Changsha Co ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-04-28
Anticipated expiration: 2039-12-06
Also published as: CN111081318B

Abstract

The invention discloses a fusion gene detection method, a system and a medium, and the implementation steps of the invention comprise: determining an input file, filtering each sequencing read of the input file, acquiring breakpoint information and a breakpoint relation, and carrying out frequency statistics; determining a fusion kmer sequence and a reference kmer sequence of the break point aiming at each found break point, judging whether a sequencing read head covering the break point is a fusion read head or a reference read head, and counting; and judging the fusion type and the gene of each paired breakpoint, and outputting the fusion gene detection result of each paired breakpoint. The invention can greatly reduce the data processing capacity and improve the efficiency of data fusion gene detection; the invention fully considers the overlapped fusion mode, the detection of the fusion mode is more definite, the precise detection of the fusion gene can be realized by precisely analyzing the comparison condition of the broken reading, and the invention has the advantages of high speed, precise detection frequency and high sensitivity.

Description

Fusion gene detection method, system and medium

Technical Field

The invention relates to a gene detection technology, in particular to a fusion gene detection method, a fusion gene detection system and a fusion gene detection medium.

Background

Gene fusion is an important genomic structural variation, and second-generation genome sequencing including WGS (whole genome sequencing), WES (whole exon sequencing) and region capture is a widely used way to detect fusion genes.

Detection of fusion genes is usually detected from discordant reads and split reads, the most important of which is split reads, by which the location of the breakpoint, the depth and frequency of the breakpoint can be determined. Therefore, most fusion detection software is judged according to split read. However, in the actual analysis process, the following problems can be found in most of the current fusion detection software:

1. because the proportion of split reads caused by a large amount of non-gene fusion in the bam file is even more than 90%, and fusion gene analysis is carried out on the split reads one by one, a large amount of computing resources and operation time are consumed, so that most of the existing fusion gene detection software is too slow in operation speed. 2. The existing fusion gene detection method does not consider the overlapping (overlap) relation between breakpoints when carrying out fusion gene detection, so that the fusion mode detection is not clear.

Moreover, the existing fusion gene detection method generally has the phenomena of inaccurate detection, slow speed and even no detection in the using process.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a fusion gene detection method, a system and a medium, wherein the method can reduce more than 90% of irrelevant data processing and operation which have no relation with fusion gene detection by further filtering based on a soft-cut softclip type, greatly reduce the data processing amount and improve the efficiency of fusion gene detection; the method elaborates the size of the overlap or the distance between paired breakpoints in detail, fully considers the specific mode of gene fusion, and has more definite detection of the fusion mode.

In order to solve the technical problems, the invention adopts the technical scheme that:

a method for detecting a fusion gene comprises the following implementation steps:

1) determining an input file, wherein the input file is a bam file, a sam file or a cram file;

2) after filtering each sequencing read of an input file, acquiring breakpoint information and a breakpoint relation and performing frequency statistics, wherein the breakpoint information comprises breakpoint positions and fusion kmer sequences at the breakpoints, and the breakpoint relation comprises the overlapping or spacing size of paired breakpoints;

3) determining a fusion kmer sequence and a reference kmer sequence of the break point aiming at each found break point, judging whether a sequencing read head covering the break point is a fusion read head or a reference read head, and counting;

4) and judging the fusion type and the gene of each paired breakpoint, and outputting the fusion gene detection result of each paired breakpoint.

Optionally, the filtering in step 2) includes preliminary filtering, where the preliminary filtering specifically refers to, for each sequencing read or a pair of sequencing reads: filtering comparison records which do not contain the soft cut softclip in the cigar domain, filtering records of which the comparison quality is lower than a set threshold value, and filtering, copying or supplementing the comparison records, wherein the information of the soft cut softclip, the comparison quality, the copying or supplementing the comparison records is all from the information of the input file.

Optionally, the filtering in step 2) includes further filtering, and the further specifically means: and preliminarily judging the reason of soft clipping of the softclip in the cigar domain aiming at each piece of sequencing read or a pair of sequencing read, and if the reason of the soft clipping softclip is the passing of the sequencing or the wrong end sequencing, filtering and deleting the piece of sequencing read or the pair of sequencing read.

Optionally, the reason for soft clipping softclip is that the following three conditions are simultaneously satisfied: (i) the sequencing reads were paired on the same chromosome and the insert size was equal to the matched length; (ii) matching the head and the tail ends; (iii) the beginning of the other sequencing read in the pair also matches; the reason for soft-cutting softclip is that terminal sequencing error means that the following three conditions are simultaneously met: (i) double ends of sequencing reads are compared on the same chromosome, and the size of the insert is smaller than a set value; (ii) the sequencing reads are matched at the beginning and mismatched at the end; (iii) the average value of the quality values of the unmatched parts is lower than the set value.

Optionally, the detailed step of obtaining the breakpoint information and the breakpoint relationship in step 2) includes: and acquiring the positions of the breakpoints, the comparison directions of the breakpoints and the fusion kmer sequences at the breakpoints, counting the frequency of the fusion kmer sequences at each breakpoint, acquiring paired breakpoints according to all comparison regions of each sequencing read, and determining the size of the overlap or the interval between the paired breakpoints as a breakpoint relation.

Optionally, the detailed step of determining the fused kmer sequence of the break point in step 3) includes: determining 1 or 2 fused kmer sequences with significantly high frequency as final fused kmer sequences according to the frequency result of the fused kmer sequences obtained by the frequency statistics in the step 2); the detailed step of determining the reference kmer sequence of the breakpoint in step 3) includes: firstly, obtaining sequencing read segments read covering the breakpoint, wherein all cigar domains do not contain soft-cut softclip, obtaining a reference kmer sequence of each sequencing read segment at the breakpoint, carrying out frequency statistics, and determining 1 or 2 reference kmer sequences with remarkably high frequency as final reference kmer sequences.

Optionally, the detailed step of determining whether the sequencing read covering the breakpoint is a fusion read or a reference read in step 3) and performing statistical counting includes: firstly, initializing the counting of the fusion read and the reference read of the breakpoint; then all sequencing read reads of the breakpoint are obtained for traversal, the similarity of the sequence of the sequencing read at the breakpoint and the fusion kmer sequence and the reference kmer sequence is calculated respectively for each traversed sequencing read, if the similarity of the sequence of the sequencing read to the fusion kmer sequence is higher, the sequencing read is judged to be the fusion read, the count of the fusion read is increased by 1, and if the similarity of the sequence of the sequencing read to the reference kmer sequence is higher, the sequencing read is judged to be the reference read, and the count of the reference read is increased by 1; and finally, obtaining the fusion read count and the reference read count of the breakpoint after traversing.

Optionally, in step 4), it is determined that the fusion type of each paired breakpoint is one of four types, namely deletion, duplication, inversion and translocation, and the fused gene detection result of each paired breakpoint is output, where the fused gene detection result includes the position of the paired breakpoint, the gene where the paired breakpoint is located, the fusion type, the fusion depth, the total depth, the kmer sequence of the breakpoint, the fused kmer sequence of the breakpoint, and the distance of the paired breakpoint, where the fusion depth is the fused read count of the breakpoint, the total depth is the sum of the fused read count of the breakpoint and the reference read count, and the distance of the paired breakpoint is the overlap or space between the paired breakpoints.

In addition, the present invention also provides a fused gene detecting system comprising a computer device programmed or configured to execute the steps of the fused gene detecting method, or a storage medium of the computer device having stored thereon a computer program programmed or configured to execute the fused gene detecting method.

In addition, the present invention also provides a computer-readable storage medium having stored thereon a computer program programmed or configured to execute the fusion gene detection method.

Compared with the prior art, the invention has the following advantages:

1. the invention elaborates the size of the overlap or the interval between the paired breakpoints in detail, fully considers the specific mode of gene fusion, and has more definite fusion mode detection.

2. The method determines the fusion kmer sequence and the reference kmer sequence of each breakpoint, and judges whether the fusion read or the reference read is the sequencing read covering the breakpoint according to the similarity between the fusion kmer sequence and the reference kmer sequence, thereby realizing the accurate detection of gene fusion and having the advantages of high speed, accurate detection frequency and high sensitivity.

Drawings

FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.

FIG. 2 is a detailed flow chart of the method according to the embodiment of the present invention.

Fig. 3 is a schematic diagram of a pair of break points (break point 1 and break point 2) in the embodiment of the present invention.

FIG. 4 is an overlapped diagram of paired breakpoints according to an embodiment of the present invention.

Detailed Description

The method, system and medium for detecting the fusion gene of the present invention will be further described in detail below by taking a bam file as an example. The bam file is the most common comparison data storage format in the current gene data analysis, is suitable for short sequencing reads and long sequencing reads, and can support 128Mbp oversized sequencing reads at the longest. With the exception that the bam file suffix is.bam, the cram files (file suffix is.cram) are all in a highly compressed format of bam files — the IO efficiency is slightly worse than the original bam files; sam files (file suffix is. sam) are plain text formats of bam files. However, the file formats of the bam/sam/cram files are the same, so that the fusion gene detection method, the system and the medium can be suitable for the bam files and also can be suitable for the sam/cram files.

As shown in FIGS. 1 and 2, the method for detecting a fusion gene according to the present embodiment includes the steps of:

2) filtering each sequencing read of an input file, obtaining breakpoint information and breakpoint relation, and performing frequency statistics, wherein the breakpoint information comprises breakpoint positions and fusion kmer sequences at the breakpoints, and the breakpoint relation comprises the overlapping or spacing size of paired breakpoints;

In general, preliminary filtration is a routine step for performing fusion gene assays, as those skilled in the art will know more or less that preliminary filtration is required to perform fusion gene assays. In this embodiment, the filtering in step 2) includes preliminary filtering, where the preliminary filtering specifically refers to, for each sequencing read or a pair of sequencing reads: the filtering client field does not contain the comparison record of the soft clip softclip (i.e. the comparison record of the complete comparison of the whole read), the filtering comparison quality is lower than the record of the set threshold value (set to 20 in this embodiment), and the filtering copying or supplementing comparison record, wherein the information of the non-soft clip softclip, the comparison quality, the copying (duplicate) or supplementing comparison (supplement) record is from the input file information (the header of the bam file).

Because the proportion of split reads caused by a large amount of non-gene fusion in the bam file is even more than 90%, and fusion gene analysis is carried out on the split reads one by one, a large amount of computing resources and operation time are consumed, so that most of the existing fusion gene detection software is too slow in operation speed. . In order to reduce the data processing amount and improve the data processing efficiency, the filtering in step 2) of this embodiment includes further filtering, which specifically means: and preliminarily judging the reason of soft clipping of the softclip in the cigar domain aiming at each piece of sequencing read or a pair of sequencing read, and if the reason of the soft clipping softclip is the passing of the sequencing or the wrong end sequencing, filtering and deleting the piece of sequencing read or the pair of sequencing read. Clip is used in the bam alignment file to describe the base sequences of one sequence that are not aligned at both ends of the sequence, and is divided into Soft Clip (Soft Clip) and hard Clip (hard Clip), wherein the Soft Clip (sometimes written as Soft Clip) refers to the sequence that exists in seq (segment sequence) although the alignment is not to the genome, at this time, the CIGAR column in the bam file corresponds to the symbol of s (Soft), and the Soft Clip refers to the sequence (not truncated and thrown away sequence) that exists on the read in the bam file although the alignment is not to the reference genome. Hardclip (Hard clip) indicates sequences that are not aligned and are not present in the bam file (truncated sequences, where CIGAR column leaves the symbol of H (Hard), but that column of sequences has no corresponding sequences). According to the method, the irrelevant data processing and operation which have no relation with the fusion gene detection can be reduced by more than 90% by further filtering based on the soft-cut softclip type, so that the data processing amount can be greatly reduced, and the efficiency of the data fusion gene detection can be improved.

In this embodiment, the reason for soft clipping softclip is that the following three conditions are satisfied simultaneously: (i) the sequencing reads were paired on the same chromosome and the insert size was equal to the matched length; (ii) matching the head and the tail ends; (iii) the beginnings of the other sequencing read in the pair also match.

In this embodiment, the reason for soft-cutting softclip is that an end sequencing error means that the following three conditions are simultaneously satisfied: (i) the two ends of the sequencing reads are compared on the same chromosome, and the size of the insert is smaller than a set value (the specific value of the embodiment is 1000 bp); (ii) the sequencing reads are matched at the beginning and mismatched at the end; (iii) the average value of the quality values of the unmatched parts is lower than the set value.

In this embodiment, the detailed step of obtaining the breakpoint information and the breakpoint relationship in step 2) includes: acquiring the positions of breakpoints, the comparison directions of the breakpoints and the fusion kmer sequences at the breakpoints, counting the frequency of the fusion kmer sequences at each breakpoint, acquiring paired breakpoints according to all comparison areas of each sequencing read, and determining the size of the overlap or the interval between the paired breakpoints as a breakpoint relation; the kmer sequence is a fixed length sequence, for example, a 12mer sequence when k =12, and represents a fixed 12bp sequence, in this embodiment, a 12mer sequence is used.

Breaking points: when one segment of the sequencing read is compared with a certain position of the reference genome, and the adjacent other segment is not compared with the certain position (possibly compared with another position, and possibly not compared with any position of the genome), the position where the two segments of the sequencing read are connected (understood as the breakpoint of the sequencing read) corresponds to the position on the reference genome and is called a breakpoint, if the multiple segments of the sequencing read are compared with the multiple positions of the reference genome, the sequencing read has multiple breakpoints on the reference genome, and the two adjacent breakpoints (breakpoints connected breakpoints) on the sequencing read position are called paired breakpoints. FIG. 3 shows an example where breakpoint 1 and breakpoint 2 are paired breakpoints, where ref1 and ref2 represent the reference gene sequence, Readk represents the k-th sequencing read, and the alignment directions of breakpoint 1 and breakpoint 2 are reversed. FIG. 4 shows an example of the overlap (overlap) between breakpoint 1 and breakpoint 2, where ref1 and ref2 represent the reference gene sequence, Readk represents the k-th sequencing reads, the alignment directions of breakpoint 1 and breakpoint 2 are reversed, and overlap is the overlap between breakpoint 1 and breakpoint 2.

Self-kmer sequence and fusion kmer sequence: when one sequence of the sequencing read is compared to a certain position of the reference genome (and another sequence is not compared to the position), the sequence 12bp before the breakpoint position on the sequencing read is called the self kmer sequence, and the sequence 12bp after the breakpoint position is called the fusion kmer sequence.

The detailed step of determining the fused kmer sequence of the breakpoint in the step 3) comprises the following steps: determining 1 or 2 fused kmer sequences with significantly high frequency as final fused kmer sequences according to the frequency result of the fused kmer sequences obtained by the frequency statistics in the step 2); the detailed step of determining the reference kmer sequence of the breakpoint in step 3) includes: firstly, obtaining sequencing read segments read (namely completely comparing to sequencing read segments of a genome) of all cigar domains covering the breakpoint and not containing soft-cut softclip, obtaining reference kmer sequences of each sequencing read segment read at the breakpoint, carrying out frequency statistics, and determining 1 or 2 reference kmer sequences with remarkably high frequency as final reference kmer sequences.

In this embodiment, the method adopted when determining 1 or 2 fusion kmer sequences with significantly high frequency and determining 1 or 2 reference kmer sequences with significantly high frequency is as follows: sequencing kmer sequences (fusion kmer sequences or reference kmer sequences) from top to bottom according to frequency numbers to obtain (d 1, d2, … di and d (i +1) … dn), and if di-d (i +1)/di > a preset threshold value (the specific value of the embodiment is 0.65), determining that the first i kmers are all kmers with remarkably high depth, namely the final kmer sequences; because the genome is 2 ploid and at most 2 final kmer sequences exist at the same time, if i < =2, i fusion kmer sequences exist in the breakpoint, and if i >2, the breakpoint is considered to have a problem and filtered; where i =1 … n-1.

In this embodiment, the detailed step of determining whether the sequencing read that covers the breakpoint is a fusion read or a reference read in step 3) and performing statistical counting includes: firstly, initializing the counting of the fusion read and the reference read of the breakpoint; then all sequencing read reads of the breakpoint are obtained for traversal, the similarity of the sequence of the sequencing read at the breakpoint and the fusion kmer sequence and the reference kmer sequence is calculated respectively for each traversed sequencing read, if the similarity of the sequence of the sequencing read to the fusion kmer sequence is higher, the sequencing read is judged to be the fusion read, the count of the fusion read is increased by 1, and if the similarity of the sequence of the sequencing read to the reference kmer sequence is higher, the sequencing read is judged to be the reference read, and the count of the reference read is increased by 1; and finally, obtaining the fusion read count and the reference read count of the breakpoint after traversing.

In this embodiment, in step 4), it is determined that the fusion type of each paired breakpoint is one of four types, namely deletion, duplication, inversion and translocation, and the fused gene detection result of each paired breakpoint is output, where the fused gene detection result includes the position of the paired breakpoint, the gene where the paired breakpoint is located, the fusion type, the fusion depth, the total depth, the kmer sequence of the breakpoint, the fused kmer sequence of the breakpoint, and the distance of the paired breakpoint, where the fusion depth is the fused read count of the breakpoint, the total depth is the sum of the fused read count of the breakpoint and the reference read count, and the distance of the paired breakpoint is the overlap or space size between the paired breakpoints (where less than 0 indicates overlap, and greater than 0 indicates space).

The fusion types of paired breakpoints of the present embodiment include only four types of deletion, copy, inversion and translocation:

A. absence (Deletion)

When the chromosome numbers of the two pairs of breakpoints are the same, the comparison directions are the same, and the position of the breakpoint 1 is smaller than that of the breakpoint 2, the fusion type is deletion.

B. Replication (replication),

When the chromosome numbers of the two pairs of breakpoints are the same, the comparison directions are the same, and the position of the breakpoint 1 is larger than that of the breakpoint 2, the fusion type is replication.

C. Reverse (Inversion)

When the chromosome numbers of two paired breakpoints are the same and the comparison directions are opposite, the fusion type is reversed.

D. Translocation (Translocation, including E and F in the schematic).

When the chromosome numbers of the two pairs of breakpoints are different, the fusion type is translocation.

In addition, the present embodiment further provides a fused gene detecting system, which includes a computer device programmed or configured to execute the steps of the fused gene detecting method of the present embodiment, or a storage medium of the computer device having stored thereon a computer program programmed or configured to execute the fused gene detecting method of the present embodiment.

In addition, the present embodiment also provides a computer-readable storage medium having stored thereon a computer program programmed or configured to execute the aforementioned fusion gene detection method of the present embodiment.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A method for detecting a fusion gene, comprising the steps of:

2. The method for detecting a fusion gene according to claim 1, wherein the filtering in step 2) comprises a preliminary filtering, specifically, for each sequencing read or a pair of sequencing reads: filtering comparison records which do not contain the soft cut softclip in the cigar domain, filtering records of which the comparison quality is lower than a set threshold value, and filtering, copying or supplementing the comparison records, wherein the information of the soft cut softclip, the comparison quality, the copying or supplementing the comparison records is all from the information of the input file.

3. The method for detecting a fused gene according to claim 1, wherein the filtering in step 2) comprises further filtering, and the further filtering specifically comprises: and preliminarily judging the reason of soft clipping of the softclip in the cigar domain aiming at each piece of sequencing read or a pair of sequencing read, and if the reason of the soft clipping softclip is the passing of the sequencing or the wrong end sequencing, filtering and deleting the piece of sequencing read or the pair of sequencing read.

4. The method for detecting a fusion gene according to claim 3, wherein the reason for soft-cutting softclip is that the following three conditions are satisfied simultaneously: (i) the sequencing reads were paired on the same chromosome and the insert size was equal to the matched length; (ii) matching the head and the tail ends; (iii) the beginning of the other sequencing read in the pair also matches; the reason for soft-cutting softclip is that terminal sequencing error means that the following three conditions are simultaneously met: (i) double ends of sequencing reads are compared on the same chromosome, and the size of the insert is smaller than a set value; (ii) the sequencing reads are matched at the beginning and mismatched at the end; (iii) the average value of the quality values of the unmatched parts is lower than the set value.

5. The method for detecting a fused gene according to claim 1, wherein the detailed step of obtaining breakpoint information and breakpoint relationship in step 2) comprises: and acquiring the positions of the breakpoints, the comparison directions of the breakpoints and the fusion kmer sequences at the breakpoints, counting the frequency of the fusion kmer sequences at each breakpoint, acquiring paired breakpoints according to all comparison regions of each sequencing read, and determining the size of the overlap or the interval between the paired breakpoints as a breakpoint relation.

6. The method for detecting fused genes according to claim 1, wherein the detailed step of determining the fused kmer sequence of the breakpoint in step 3) comprises: determining 1 or 2 fused kmer sequences with significantly high frequency as final fused kmer sequences according to the frequency result of the fused kmer sequences obtained by the frequency statistics in the step 2); the detailed step of determining the reference kmer sequence of the breakpoint in step 3) includes: firstly, obtaining sequencing read segments read covering the breakpoint, wherein all cigar domains do not contain soft-cut softclip, obtaining a reference kmer sequence of each sequencing read segment at the breakpoint, carrying out frequency statistics, and determining 1 or 2 reference kmer sequences with remarkably high frequency as final reference kmer sequences.

7. The method for detecting fused genes according to claim 1, wherein the detailed step of determining whether the sequencing read covering the breakpoint is a fused read or a reference read and performing statistical counting in step 3) comprises: firstly, initializing the counting of the fusion read and the reference read of the breakpoint; then all sequencing read reads of the breakpoint are obtained for traversal, the similarity of the sequence of the sequencing read at the breakpoint and the fusion kmer sequence and the reference kmer sequence is calculated respectively for each traversed sequencing read, if the similarity of the sequence of the sequencing read to the fusion kmer sequence is higher, the sequencing read is judged to be the fusion read, the count of the fusion read is increased by 1, and if the similarity of the sequence of the sequencing read to the reference kmer sequence is higher, the sequencing read is judged to be the reference read, and the count of the reference read is increased by 1; and finally, obtaining the fusion read count and the reference read count of the breakpoint after traversing.

8. The method for detecting fused genes according to claim 1, wherein in the step 4) it is determined that the fusion type of each paired breakpoint is one of four types, namely deletion, duplication, inversion and translocation, and the fused gene detection result of each paired breakpoint is output, including the position of the paired breakpoint, the gene where the paired breakpoint is located, the fusion type, the fusion depth, the total depth, the self kmer sequence of the breakpoint, the fused kmer sequence of the breakpoint, and the distance of the paired breakpoint, wherein the fusion depth is the fused read count of the breakpoint, the total depth is the sum of the fused read count of the breakpoint and the reference read count, and the distance of the paired breakpoint is the size of the overlap or the distance between the paired breakpoints.

9. A fused gene detecting system comprising a computer device, wherein the computer device is programmed or configured to execute the steps of the fused gene detecting method according to any one of claims 1 to 8, or a storage medium of the computer device has stored thereon a computer program programmed or configured to execute the fused gene detecting method according to any one of claims 1 to 8.

10. A computer-readable storage medium having stored thereon a computer program programmed or configured to perform the method for detecting a fused gene according to any one of claims 1 to 8.