CN111081318A - Fusion gene detection method, system and medium - Google Patents

Fusion gene detection method, system and medium Download PDF

Info

Publication number
CN111081318A
CN111081318A CN201911243763.8A CN201911243763A CN111081318A CN 111081318 A CN111081318 A CN 111081318A CN 201911243763 A CN201911243763 A CN 201911243763A CN 111081318 A CN111081318 A CN 111081318A
Authority
CN
China
Prior art keywords
breakpoint
fusion
read
sequencing
kmer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911243763.8A
Other languages
Chinese (zh)
Other versions
CN111081318B (en
Inventor
曾华萍
传军
吴桂枝
王益民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Genetalks Bio Tech Changsha Co ltd
Original Assignee
Genetalks Bio Tech Changsha Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genetalks Bio Tech Changsha Co ltd filed Critical Genetalks Bio Tech Changsha Co ltd
Priority to CN201911243763.8A priority Critical patent/CN111081318B/en
Publication of CN111081318A publication Critical patent/CN111081318A/en
Application granted granted Critical
Publication of CN111081318B publication Critical patent/CN111081318B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a fusion gene detection method, a system and a medium, and the implementation steps of the invention comprise: determining an input file, filtering each sequencing read of the input file, acquiring breakpoint information and a breakpoint relation, and carrying out frequency statistics; determining a fusion kmer sequence and a reference kmer sequence of the break point aiming at each found break point, judging whether a sequencing read head covering the break point is a fusion read head or a reference read head, and counting; and judging the fusion type and the gene of each paired breakpoint, and outputting the fusion gene detection result of each paired breakpoint. The invention can greatly reduce the data processing capacity and improve the efficiency of data fusion gene detection; the invention fully considers the overlapped fusion mode, the detection of the fusion mode is more definite, the precise detection of the fusion gene can be realized by precisely analyzing the comparison condition of the broken reading, and the invention has the advantages of high speed, precise detection frequency and high sensitivity.

Description

Fusion gene detection method, system and medium
Technical Field
The invention relates to a gene detection technology, in particular to a fusion gene detection method, a fusion gene detection system and a fusion gene detection medium.
Background
Gene fusion is an important genomic structural variation, and second-generation genome sequencing including WGS (whole genome sequencing), WES (whole exon sequencing) and region capture is a widely used way to detect fusion genes.
Detection of fusion genes is usually detected from discordant reads and split reads, the most important of which is split reads, by which the location of the breakpoint, the depth and frequency of the breakpoint can be determined. Therefore, most fusion detection software is judged according to split read. However, in the actual analysis process, the following problems can be found in most of the current fusion detection software:
1. because the proportion of split reads caused by a large amount of non-gene fusion in the bam file is even more than 90%, and fusion gene analysis is carried out on the split reads one by one, a large amount of computing resources and operation time are consumed, so that most of the existing fusion gene detection software is too slow in operation speed. 2. The existing fusion gene detection method does not consider the overlapping (overlap) relation between breakpoints when carrying out fusion gene detection, so that the fusion mode detection is not clear.
Moreover, the existing fusion gene detection method generally has the phenomena of inaccurate detection, slow speed and even no detection in the using process.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a fusion gene detection method, a system and a medium, wherein the method can reduce more than 90% of irrelevant data processing and operation which have no relation with fusion gene detection by further filtering based on a soft-cut softclip type, greatly reduce the data processing amount and improve the efficiency of fusion gene detection; the method elaborates the size of the overlap or the distance between paired breakpoints in detail, fully considers the specific mode of gene fusion, and has more definite detection of the fusion mode.
In order to solve the technical problems, the invention adopts the technical scheme that:
a method for detecting a fusion gene comprises the following implementation steps:
1) determining an input file, wherein the input file is a bam file, a sam file or a cram file;
2) after filtering each sequencing read of an input file, acquiring breakpoint information and a breakpoint relation and performing frequency statistics, wherein the breakpoint information comprises breakpoint positions and fusion kmer sequences at the breakpoints, and the breakpoint relation comprises the overlapping or spacing size of paired breakpoints;
3) determining a fusion kmer sequence and a reference kmer sequence of the break point aiming at each found break point, judging whether a sequencing read head covering the break point is a fusion read head or a reference read head, and counting;
4) and judging the fusion type and the gene of each paired breakpoint, and outputting the fusion gene detection result of each paired breakpoint.
Optionally, the filtering in step 2) includes preliminary filtering, where the preliminary filtering specifically refers to, for each sequencing read or a pair of sequencing reads: filtering comparison records which do not contain the soft cut softclip in the cigar domain, filtering records of which the comparison quality is lower than a set threshold value, and filtering, copying or supplementing the comparison records, wherein the information of the soft cut softclip, the comparison quality, the copying or supplementing the comparison records is all from the information of the input file.
Optionally, the filtering in step 2) includes further filtering, and the further specifically means: and preliminarily judging the reason of soft clipping of the softclip in the cigar domain aiming at each piece of sequencing read or a pair of sequencing read, and if the reason of the soft clipping softclip is the passing of the sequencing or the wrong end sequencing, filtering and deleting the piece of sequencing read or the pair of sequencing read.
Optionally, the reason for soft clipping softclip is that the following three conditions are simultaneously satisfied: (i) the sequencing reads were paired on the same chromosome and the insert size was equal to the matched length; (ii) matching the head and the tail ends; (iii) the beginning of the other sequencing read in the pair also matches; the reason for soft-cutting softclip is that terminal sequencing error means that the following three conditions are simultaneously met: (i) double ends of sequencing reads are compared on the same chromosome, and the size of the insert is smaller than a set value; (ii) the sequencing reads are matched at the beginning and mismatched at the end; (iii) the average value of the quality values of the unmatched parts is lower than the set value.
Optionally, the detailed step of obtaining the breakpoint information and the breakpoint relationship in step 2) includes: and acquiring the positions of the breakpoints, the comparison directions of the breakpoints and the fusion kmer sequences at the breakpoints, counting the frequency of the fusion kmer sequences at each breakpoint, acquiring paired breakpoints according to all comparison regions of each sequencing read, and determining the size of the overlap or the interval between the paired breakpoints as a breakpoint relation.
Optionally, the detailed step of determining the fused kmer sequence of the break point in step 3) includes: determining 1 or 2 fused kmer sequences with significantly high frequency as final fused kmer sequences according to the frequency result of the fused kmer sequences obtained by the frequency statistics in the step 2); the detailed step of determining the reference kmer sequence of the breakpoint in step 3) includes: firstly, obtaining sequencing read segments read covering the breakpoint, wherein all cigar domains do not contain soft-cut softclip, obtaining a reference kmer sequence of each sequencing read segment at the breakpoint, carrying out frequency statistics, and determining 1 or 2 reference kmer sequences with remarkably high frequency as final reference kmer sequences.
Optionally, the detailed step of determining whether the sequencing read covering the breakpoint is a fusion read or a reference read in step 3) and performing statistical counting includes: firstly, initializing the counting of the fusion read and the reference read of the breakpoint; then all sequencing read reads of the breakpoint are obtained for traversal, the similarity of the sequence of the sequencing read at the breakpoint and the fusion kmer sequence and the reference kmer sequence is calculated respectively for each traversed sequencing read, if the similarity of the sequence of the sequencing read to the fusion kmer sequence is higher, the sequencing read is judged to be the fusion read, the count of the fusion read is increased by 1, and if the similarity of the sequence of the sequencing read to the reference kmer sequence is higher, the sequencing read is judged to be the reference read, and the count of the reference read is increased by 1; and finally, obtaining the fusion read count and the reference read count of the breakpoint after traversing.
Optionally, in step 4), it is determined that the fusion type of each paired breakpoint is one of four types, namely deletion, duplication, inversion and translocation, and the fused gene detection result of each paired breakpoint is output, where the fused gene detection result includes the position of the paired breakpoint, the gene where the paired breakpoint is located, the fusion type, the fusion depth, the total depth, the kmer sequence of the breakpoint, the fused kmer sequence of the breakpoint, and the distance of the paired breakpoint, where the fusion depth is the fused read count of the breakpoint, the total depth is the sum of the fused read count of the breakpoint and the reference read count, and the distance of the paired breakpoint is the overlap or space between the paired breakpoints.
In addition, the present invention also provides a fused gene detecting system comprising a computer device programmed or configured to execute the steps of the fused gene detecting method, or a storage medium of the computer device having stored thereon a computer program programmed or configured to execute the fused gene detecting method.
In addition, the present invention also provides a computer-readable storage medium having stored thereon a computer program programmed or configured to execute the fusion gene detection method.
Compared with the prior art, the invention has the following advantages:
1. the invention elaborates the size of the overlap or the interval between the paired breakpoints in detail, fully considers the specific mode of gene fusion, and has more definite fusion mode detection.
2. The method determines the fusion kmer sequence and the reference kmer sequence of each breakpoint, and judges whether the fusion read or the reference read is the sequencing read covering the breakpoint according to the similarity between the fusion kmer sequence and the reference kmer sequence, thereby realizing the accurate detection of gene fusion and having the advantages of high speed, accurate detection frequency and high sensitivity.
Drawings
FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
FIG. 2 is a detailed flow chart of the method according to the embodiment of the present invention.
Fig. 3 is a schematic diagram of a pair of break points (break point 1 and break point 2) in the embodiment of the present invention.
FIG. 4 is an overlapped diagram of paired breakpoints according to an embodiment of the present invention.
Detailed Description
The method, system and medium for detecting the fusion gene of the present invention will be further described in detail below by taking a bam file as an example. The bam file is the most common comparison data storage format in the current gene data analysis, is suitable for short sequencing reads and long sequencing reads, and can support 128Mbp oversized sequencing reads at the longest. With the exception that the bam file suffix is.bam, the cram files (file suffix is.cram) are all in a highly compressed format of bam files — the IO efficiency is slightly worse than the original bam files; sam files (file suffix is. sam) are plain text formats of bam files. However, the file formats of the bam/sam/cram files are the same, so that the fusion gene detection method, the system and the medium can be suitable for the bam files and also can be suitable for the sam/cram files.
As shown in FIGS. 1 and 2, the method for detecting a fusion gene according to the present embodiment includes the steps of:
1) determining an input file, wherein the input file is a bam file, a sam file or a cram file;
2) filtering each sequencing read of an input file, obtaining breakpoint information and breakpoint relation, and performing frequency statistics, wherein the breakpoint information comprises breakpoint positions and fusion kmer sequences at the breakpoints, and the breakpoint relation comprises the overlapping or spacing size of paired breakpoints;
3) determining a fusion kmer sequence and a reference kmer sequence of the break point aiming at each found break point, judging whether a sequencing read head covering the break point is a fusion read head or a reference read head, and counting;
4) and judging the fusion type and the gene of each paired breakpoint, and outputting the fusion gene detection result of each paired breakpoint.
In general, preliminary filtration is a routine step for performing fusion gene assays, as those skilled in the art will know more or less that preliminary filtration is required to perform fusion gene assays. In this embodiment, the filtering in step 2) includes preliminary filtering, where the preliminary filtering specifically refers to, for each sequencing read or a pair of sequencing reads: the filtering client field does not contain the comparison record of the soft clip softclip (i.e. the comparison record of the complete comparison of the whole read), the filtering comparison quality is lower than the record of the set threshold value (set to 20 in this embodiment), and the filtering copying or supplementing comparison record, wherein the information of the non-soft clip softclip, the comparison quality, the copying (duplicate) or supplementing comparison (supplement) record is from the input file information (the header of the bam file).
Because the proportion of split reads caused by a large amount of non-gene fusion in the bam file is even more than 90%, and fusion gene analysis is carried out on the split reads one by one, a large amount of computing resources and operation time are consumed, so that most of the existing fusion gene detection software is too slow in operation speed. . In order to reduce the data processing amount and improve the data processing efficiency, the filtering in step 2) of this embodiment includes further filtering, which specifically means: and preliminarily judging the reason of soft clipping of the softclip in the cigar domain aiming at each piece of sequencing read or a pair of sequencing read, and if the reason of the soft clipping softclip is the passing of the sequencing or the wrong end sequencing, filtering and deleting the piece of sequencing read or the pair of sequencing read. Clip is used in the bam alignment file to describe the base sequences of one sequence that are not aligned at both ends of the sequence, and is divided into Soft Clip (Soft Clip) and hard Clip (hard Clip), wherein the Soft Clip (sometimes written as Soft Clip) refers to the sequence that exists in seq (segment sequence) although the alignment is not to the genome, at this time, the CIGAR column in the bam file corresponds to the symbol of s (Soft), and the Soft Clip refers to the sequence (not truncated and thrown away sequence) that exists on the read in the bam file although the alignment is not to the reference genome. Hardclip (Hard clip) indicates sequences that are not aligned and are not present in the bam file (truncated sequences, where CIGAR column leaves the symbol of H (Hard), but that column of sequences has no corresponding sequences). According to the method, the irrelevant data processing and operation which have no relation with the fusion gene detection can be reduced by more than 90% by further filtering based on the soft-cut softclip type, so that the data processing amount can be greatly reduced, and the efficiency of the data fusion gene detection can be improved.
In this embodiment, the reason for soft clipping softclip is that the following three conditions are satisfied simultaneously: (i) the sequencing reads were paired on the same chromosome and the insert size was equal to the matched length; (ii) matching the head and the tail ends; (iii) the beginnings of the other sequencing read in the pair also match.
In this embodiment, the reason for soft-cutting softclip is that an end sequencing error means that the following three conditions are simultaneously satisfied: (i) the two ends of the sequencing reads are compared on the same chromosome, and the size of the insert is smaller than a set value (the specific value of the embodiment is 1000 bp); (ii) the sequencing reads are matched at the beginning and mismatched at the end; (iii) the average value of the quality values of the unmatched parts is lower than the set value.
In this embodiment, the detailed step of obtaining the breakpoint information and the breakpoint relationship in step 2) includes: acquiring the positions of breakpoints, the comparison directions of the breakpoints and the fusion kmer sequences at the breakpoints, counting the frequency of the fusion kmer sequences at each breakpoint, acquiring paired breakpoints according to all comparison areas of each sequencing read, and determining the size of the overlap or the interval between the paired breakpoints as a breakpoint relation; the kmer sequence is a fixed length sequence, for example, a 12mer sequence when k =12, and represents a fixed 12bp sequence, in this embodiment, a 12mer sequence is used.
Breaking points: when one segment of the sequencing read is compared with a certain position of the reference genome, and the adjacent other segment is not compared with the certain position (possibly compared with another position, and possibly not compared with any position of the genome), the position where the two segments of the sequencing read are connected (understood as the breakpoint of the sequencing read) corresponds to the position on the reference genome and is called a breakpoint, if the multiple segments of the sequencing read are compared with the multiple positions of the reference genome, the sequencing read has multiple breakpoints on the reference genome, and the two adjacent breakpoints (breakpoints connected breakpoints) on the sequencing read position are called paired breakpoints. FIG. 3 shows an example where breakpoint 1 and breakpoint 2 are paired breakpoints, where ref1 and ref2 represent the reference gene sequence, Readk represents the k-th sequencing read, and the alignment directions of breakpoint 1 and breakpoint 2 are reversed. FIG. 4 shows an example of the overlap (overlap) between breakpoint 1 and breakpoint 2, where ref1 and ref2 represent the reference gene sequence, Readk represents the k-th sequencing reads, the alignment directions of breakpoint 1 and breakpoint 2 are reversed, and overlap is the overlap between breakpoint 1 and breakpoint 2.
Self-kmer sequence and fusion kmer sequence: when one sequence of the sequencing read is compared to a certain position of the reference genome (and another sequence is not compared to the position), the sequence 12bp before the breakpoint position on the sequencing read is called the self kmer sequence, and the sequence 12bp after the breakpoint position is called the fusion kmer sequence.
The detailed step of determining the fused kmer sequence of the breakpoint in the step 3) comprises the following steps: determining 1 or 2 fused kmer sequences with significantly high frequency as final fused kmer sequences according to the frequency result of the fused kmer sequences obtained by the frequency statistics in the step 2); the detailed step of determining the reference kmer sequence of the breakpoint in step 3) includes: firstly, obtaining sequencing read segments read (namely completely comparing to sequencing read segments of a genome) of all cigar domains covering the breakpoint and not containing soft-cut softclip, obtaining reference kmer sequences of each sequencing read segment read at the breakpoint, carrying out frequency statistics, and determining 1 or 2 reference kmer sequences with remarkably high frequency as final reference kmer sequences.
In this embodiment, the method adopted when determining 1 or 2 fusion kmer sequences with significantly high frequency and determining 1 or 2 reference kmer sequences with significantly high frequency is as follows: sequencing kmer sequences (fusion kmer sequences or reference kmer sequences) from top to bottom according to frequency numbers to obtain (d 1, d2, … di and d (i +1) … dn), and if di-d (i +1)/di > a preset threshold value (the specific value of the embodiment is 0.65), determining that the first i kmers are all kmers with remarkably high depth, namely the final kmer sequences; because the genome is 2 ploid and at most 2 final kmer sequences exist at the same time, if i < =2, i fusion kmer sequences exist in the breakpoint, and if i >2, the breakpoint is considered to have a problem and filtered; where i =1 … n-1.
In this embodiment, the detailed step of determining whether the sequencing read that covers the breakpoint is a fusion read or a reference read in step 3) and performing statistical counting includes: firstly, initializing the counting of the fusion read and the reference read of the breakpoint; then all sequencing read reads of the breakpoint are obtained for traversal, the similarity of the sequence of the sequencing read at the breakpoint and the fusion kmer sequence and the reference kmer sequence is calculated respectively for each traversed sequencing read, if the similarity of the sequence of the sequencing read to the fusion kmer sequence is higher, the sequencing read is judged to be the fusion read, the count of the fusion read is increased by 1, and if the similarity of the sequence of the sequencing read to the reference kmer sequence is higher, the sequencing read is judged to be the reference read, and the count of the reference read is increased by 1; and finally, obtaining the fusion read count and the reference read count of the breakpoint after traversing.
In this embodiment, in step 4), it is determined that the fusion type of each paired breakpoint is one of four types, namely deletion, duplication, inversion and translocation, and the fused gene detection result of each paired breakpoint is output, where the fused gene detection result includes the position of the paired breakpoint, the gene where the paired breakpoint is located, the fusion type, the fusion depth, the total depth, the kmer sequence of the breakpoint, the fused kmer sequence of the breakpoint, and the distance of the paired breakpoint, where the fusion depth is the fused read count of the breakpoint, the total depth is the sum of the fused read count of the breakpoint and the reference read count, and the distance of the paired breakpoint is the overlap or space size between the paired breakpoints (where less than 0 indicates overlap, and greater than 0 indicates space).
The fusion types of paired breakpoints of the present embodiment include only four types of deletion, copy, inversion and translocation:
A. absence (Deletion)
When the chromosome numbers of the two pairs of breakpoints are the same, the comparison directions are the same, and the position of the breakpoint 1 is smaller than that of the breakpoint 2, the fusion type is deletion.
B. Replication (replication),
When the chromosome numbers of the two pairs of breakpoints are the same, the comparison directions are the same, and the position of the breakpoint 1 is larger than that of the breakpoint 2, the fusion type is replication.
C. Reverse (Inversion)
When the chromosome numbers of two paired breakpoints are the same and the comparison directions are opposite, the fusion type is reversed.
D. Translocation (Translocation, including E and F in the schematic).
When the chromosome numbers of the two pairs of breakpoints are different, the fusion type is translocation.
In addition, the present embodiment further provides a fused gene detecting system, which includes a computer device programmed or configured to execute the steps of the fused gene detecting method of the present embodiment, or a storage medium of the computer device having stored thereon a computer program programmed or configured to execute the fused gene detecting method of the present embodiment.
In addition, the present embodiment also provides a computer-readable storage medium having stored thereon a computer program programmed or configured to execute the aforementioned fusion gene detection method of the present embodiment.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (10)

1. A method for detecting a fusion gene, comprising the steps of:
1) determining an input file, wherein the input file is a bam file, a sam file or a cram file;
2) after filtering each sequencing read of an input file, acquiring breakpoint information and a breakpoint relation and performing frequency statistics, wherein the breakpoint information comprises breakpoint positions and fusion kmer sequences at the breakpoints, and the breakpoint relation comprises the overlapping or spacing size of paired breakpoints;
3) determining a fusion kmer sequence and a reference kmer sequence of the break point aiming at each found break point, judging whether a sequencing read head covering the break point is a fusion read head or a reference read head, and counting;
4) and judging the fusion type and the gene of each paired breakpoint, and outputting the fusion gene detection result of each paired breakpoint.
2. The method for detecting a fusion gene according to claim 1, wherein the filtering in step 2) comprises a preliminary filtering, specifically, for each sequencing read or a pair of sequencing reads: filtering comparison records which do not contain the soft cut softclip in the cigar domain, filtering records of which the comparison quality is lower than a set threshold value, and filtering, copying or supplementing the comparison records, wherein the information of the soft cut softclip, the comparison quality, the copying or supplementing the comparison records is all from the information of the input file.
3. The method for detecting a fused gene according to claim 1, wherein the filtering in step 2) comprises further filtering, and the further filtering specifically comprises: and preliminarily judging the reason of soft clipping of the softclip in the cigar domain aiming at each piece of sequencing read or a pair of sequencing read, and if the reason of the soft clipping softclip is the passing of the sequencing or the wrong end sequencing, filtering and deleting the piece of sequencing read or the pair of sequencing read.
4. The method for detecting a fusion gene according to claim 3, wherein the reason for soft-cutting softclip is that the following three conditions are satisfied simultaneously: (i) the sequencing reads were paired on the same chromosome and the insert size was equal to the matched length; (ii) matching the head and the tail ends; (iii) the beginning of the other sequencing read in the pair also matches; the reason for soft-cutting softclip is that terminal sequencing error means that the following three conditions are simultaneously met: (i) double ends of sequencing reads are compared on the same chromosome, and the size of the insert is smaller than a set value; (ii) the sequencing reads are matched at the beginning and mismatched at the end; (iii) the average value of the quality values of the unmatched parts is lower than the set value.
5. The method for detecting a fused gene according to claim 1, wherein the detailed step of obtaining breakpoint information and breakpoint relationship in step 2) comprises: and acquiring the positions of the breakpoints, the comparison directions of the breakpoints and the fusion kmer sequences at the breakpoints, counting the frequency of the fusion kmer sequences at each breakpoint, acquiring paired breakpoints according to all comparison regions of each sequencing read, and determining the size of the overlap or the interval between the paired breakpoints as a breakpoint relation.
6. The method for detecting fused genes according to claim 1, wherein the detailed step of determining the fused kmer sequence of the breakpoint in step 3) comprises: determining 1 or 2 fused kmer sequences with significantly high frequency as final fused kmer sequences according to the frequency result of the fused kmer sequences obtained by the frequency statistics in the step 2); the detailed step of determining the reference kmer sequence of the breakpoint in step 3) includes: firstly, obtaining sequencing read segments read covering the breakpoint, wherein all cigar domains do not contain soft-cut softclip, obtaining a reference kmer sequence of each sequencing read segment at the breakpoint, carrying out frequency statistics, and determining 1 or 2 reference kmer sequences with remarkably high frequency as final reference kmer sequences.
7. The method for detecting fused genes according to claim 1, wherein the detailed step of determining whether the sequencing read covering the breakpoint is a fused read or a reference read and performing statistical counting in step 3) comprises: firstly, initializing the counting of the fusion read and the reference read of the breakpoint; then all sequencing read reads of the breakpoint are obtained for traversal, the similarity of the sequence of the sequencing read at the breakpoint and the fusion kmer sequence and the reference kmer sequence is calculated respectively for each traversed sequencing read, if the similarity of the sequence of the sequencing read to the fusion kmer sequence is higher, the sequencing read is judged to be the fusion read, the count of the fusion read is increased by 1, and if the similarity of the sequence of the sequencing read to the reference kmer sequence is higher, the sequencing read is judged to be the reference read, and the count of the reference read is increased by 1; and finally, obtaining the fusion read count and the reference read count of the breakpoint after traversing.
8. The method for detecting fused genes according to claim 1, wherein in the step 4) it is determined that the fusion type of each paired breakpoint is one of four types, namely deletion, duplication, inversion and translocation, and the fused gene detection result of each paired breakpoint is output, including the position of the paired breakpoint, the gene where the paired breakpoint is located, the fusion type, the fusion depth, the total depth, the self kmer sequence of the breakpoint, the fused kmer sequence of the breakpoint, and the distance of the paired breakpoint, wherein the fusion depth is the fused read count of the breakpoint, the total depth is the sum of the fused read count of the breakpoint and the reference read count, and the distance of the paired breakpoint is the size of the overlap or the distance between the paired breakpoints.
9. A fused gene detecting system comprising a computer device, wherein the computer device is programmed or configured to execute the steps of the fused gene detecting method according to any one of claims 1 to 8, or a storage medium of the computer device has stored thereon a computer program programmed or configured to execute the fused gene detecting method according to any one of claims 1 to 8.
10. A computer-readable storage medium having stored thereon a computer program programmed or configured to perform the method for detecting a fused gene according to any one of claims 1 to 8.
CN201911243763.8A 2019-12-06 2019-12-06 Fusion gene detection method, system and medium Active CN111081318B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911243763.8A CN111081318B (en) 2019-12-06 2019-12-06 Fusion gene detection method, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911243763.8A CN111081318B (en) 2019-12-06 2019-12-06 Fusion gene detection method, system and medium

Publications (2)

Publication Number Publication Date
CN111081318A true CN111081318A (en) 2020-04-28
CN111081318B CN111081318B (en) 2023-06-06

Family

ID=70313106

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911243763.8A Active CN111081318B (en) 2019-12-06 2019-12-06 Fusion gene detection method, system and medium

Country Status (1)

Country Link
CN (1) CN111081318B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164423A (en) * 2020-10-14 2021-01-01 深圳吉因加医学检验实验室 Fusion gene detection method, device and storage medium based on RNAseq data
CN112397142A (en) * 2020-10-13 2021-02-23 山东大学 Gene variation detection method and system for multi-core processor
CN112687341A (en) * 2021-03-12 2021-04-20 上海思路迪医学检验所有限公司 Method for identifying chromosome structure variation by taking breakpoint as center
CN113035273A (en) * 2021-03-11 2021-06-25 南京先声医学检验有限公司 Rapid and ultrahigh-sensitivity DNA fusion gene detection method
CN114300051A (en) * 2021-12-22 2022-04-08 北京吉因加医学检验实验室有限公司 Method and device for calculating fusion gene frequency
CN114334006A (en) * 2021-12-29 2022-04-12 纳昂达(南京)生物科技有限公司 Method and device for introducing noise in enzyme digestion library building mode
CN115331733A (en) * 2022-10-14 2022-11-11 青岛百创智能制造技术有限公司 Method and device for analyzing sequencing data of space transcriptome chip
CN117831620A (en) * 2023-12-30 2024-04-05 北京诺禾致源科技股份有限公司 Gene fusion site detection method and electronic device
CN117935933A (en) * 2024-03-21 2024-04-26 北京乐土医学检验实验室有限公司 Analysis method and system for CDKN2A/B homozygosity deletion

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012100626A (en) * 2010-11-12 2012-05-31 Srl Inc Method of detecting translocation gene
US20120178635A1 (en) * 2009-08-06 2012-07-12 University Of Virginia Patent Foundation Compositions and methods for identifying and detecting sites of translocation and dna fusion junctions
US20150094212A1 (en) * 2013-10-01 2015-04-02 Life Technologies Corporation Systems and Methods for Detecting Structural Variants
KR20170064258A (en) * 2015-12-01 2017-06-09 삼성에스디에스 주식회사 Method and apparatus for detecting breakpoint position of fusion gene
CN107408162A (en) * 2015-06-24 2017-11-28 社会福祉法人三星生命公益财团 For analyzing the method and device of gene
CN107437002A (en) * 2017-04-28 2017-12-05 首度生物科技(苏州)有限公司 A kind of method of quick detection fusion
WO2018215497A1 (en) * 2017-05-25 2018-11-29 Koninklijke Philips N.V. System and method for detecting gene fusion
CN109628599A (en) * 2019-01-08 2019-04-16 大连医科大学附属第二医院 ALK fusion gene detection and parting kit based on the analysis of sandwich method high-resolution fusion curve
CN109712672A (en) * 2018-12-29 2019-05-03 北京优迅医学检验实验室有限公司 Detect method, apparatus, storage medium and the processor of gene rearrangement
CN109994155A (en) * 2019-03-29 2019-07-09 北京市商汤科技开发有限公司 A kind of genetic mutation recognition methods, device and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120178635A1 (en) * 2009-08-06 2012-07-12 University Of Virginia Patent Foundation Compositions and methods for identifying and detecting sites of translocation and dna fusion junctions
JP2012100626A (en) * 2010-11-12 2012-05-31 Srl Inc Method of detecting translocation gene
US20150094212A1 (en) * 2013-10-01 2015-04-02 Life Technologies Corporation Systems and Methods for Detecting Structural Variants
CN107408162A (en) * 2015-06-24 2017-11-28 社会福祉法人三星生命公益财团 For analyzing the method and device of gene
KR20170064258A (en) * 2015-12-01 2017-06-09 삼성에스디에스 주식회사 Method and apparatus for detecting breakpoint position of fusion gene
CN107437002A (en) * 2017-04-28 2017-12-05 首度生物科技(苏州)有限公司 A kind of method of quick detection fusion
WO2018215497A1 (en) * 2017-05-25 2018-11-29 Koninklijke Philips N.V. System and method for detecting gene fusion
CN109712672A (en) * 2018-12-29 2019-05-03 北京优迅医学检验实验室有限公司 Detect method, apparatus, storage medium and the processor of gene rearrangement
CN109628599A (en) * 2019-01-08 2019-04-16 大连医科大学附属第二医院 ALK fusion gene detection and parting kit based on the analysis of sandwich method high-resolution fusion curve
CN109994155A (en) * 2019-03-29 2019-07-09 北京市商汤科技开发有限公司 A kind of genetic mutation recognition methods, device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KRISTIAN UNGER 等: "Novel gene rearrangements in transformed breast cells identified by highresolution breakpoint analysis of chromosomal aberrations" *
靳卫东 等: "bcr-abl基因融合及其检测" *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112397142B (en) * 2020-10-13 2023-02-03 山东大学 Gene variation detection method and system for multi-core processor
CN112397142A (en) * 2020-10-13 2021-02-23 山东大学 Gene variation detection method and system for multi-core processor
CN112164423B (en) * 2020-10-14 2021-03-23 深圳吉因加医学检验实验室 Fusion gene detection method, device and storage medium based on RNAseq data
CN112164423A (en) * 2020-10-14 2021-01-01 深圳吉因加医学检验实验室 Fusion gene detection method, device and storage medium based on RNAseq data
CN113035273A (en) * 2021-03-11 2021-06-25 南京先声医学检验有限公司 Rapid and ultrahigh-sensitivity DNA fusion gene detection method
CN112687341A (en) * 2021-03-12 2021-04-20 上海思路迪医学检验所有限公司 Method for identifying chromosome structure variation by taking breakpoint as center
CN112687341B (en) * 2021-03-12 2021-06-04 上海思路迪医学检验所有限公司 Method for identifying chromosome structure variation by taking breakpoint as center
CN114300051A (en) * 2021-12-22 2022-04-08 北京吉因加医学检验实验室有限公司 Method and device for calculating fusion gene frequency
CN114334006B (en) * 2021-12-29 2022-11-29 纳昂达(南京)生物科技有限公司 Method and device for introducing noise in enzyme digestion library building mode
CN114334006A (en) * 2021-12-29 2022-04-12 纳昂达(南京)生物科技有限公司 Method and device for introducing noise in enzyme digestion library building mode
CN115331733A (en) * 2022-10-14 2022-11-11 青岛百创智能制造技术有限公司 Method and device for analyzing sequencing data of space transcriptome chip
CN117831620A (en) * 2023-12-30 2024-04-05 北京诺禾致源科技股份有限公司 Gene fusion site detection method and electronic device
CN117935933A (en) * 2024-03-21 2024-04-26 北京乐土医学检验实验室有限公司 Analysis method and system for CDKN2A/B homozygosity deletion
CN117935933B (en) * 2024-03-21 2024-05-31 北京乐土医学检验实验室有限公司 Analysis method and system for CDKN2A/B homozygosity deletion

Also Published As

Publication number Publication date
CN111081318B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN111081318A (en) Fusion gene detection method, system and medium
CN107992721B (en) Method, apparatus and storage medium for detecting target region gene fusion
CN111326212B (en) Structural variation detection method
CN110808084B (en) Copy number variation detection method based on single-sample second-generation sequencing data
CN106205731B (en) Information processing method and storage equipment
CN110993023A (en) Detection method and detection device for complex mutation
CN111292809B (en) Method, electronic device, and computer storage medium for detecting RNA level gene fusion
CN102880718B (en) A kind of storage of flexible daily record and acquisition methods
CN101989322B (en) Method and system for automatically extracting memory features of malicious code
CN116189763A (en) Single sample copy number variation detection method based on second generation sequencing
CN100530234C (en) Recessive writing detection method in the light of DCT zone LSB recessive writing
CN112397148A (en) Sequence comparison method, sequence correction method and device thereof
CN109658985B (en) Redundancy removal optimization method and system for gene reference sequence
CN115954052B (en) Screening method and system for monitoring sites of tiny residual focus of solid tumor
NL2014199A (en) A computer implemented method for generating a variant call file.
CN109698011B (en) Indel region correction method and system based on short sequence comparison
CN112489727A (en) Method and system for rapidly acquiring pathogenic site of rare disease
CN115662512A (en) Method, device, equipment and medium for detecting point mutation in multiplex PCR (polymerase chain reaction) sequencing
WO2022188696A1 (en) Method for automatically identifying and analyzing gel image
CN114328399B (en) Method and system for automatically pairing gene sequencing multi-sample data files
CN114937279A (en) BOM recognition method and device for realizing IA based on RPA and AI
JPS60113361A (en) Discrimination system of magnetic disk format
JPS6136874A (en) Corrected character processing method for optical character reader
JPS63196990A (en) Bar-code reader
CN115391284B (en) Method, system and computer readable storage medium for quickly identifying gene data file

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant