CN111292809B

CN111292809B - Method, electronic device, and computer storage medium for detecting RNA level gene fusion

Info

Publication number: CN111292809B
Application number: CN202010066733.0A
Authority: CN
Inventors: 王凯; 陈惠�
Original assignee: Shanghai Zhiben Medical Laboratory Co ltd; Origimed Technology Shanghai Co ltd
Current assignee: ORIGIMED TECHNOLOGY (SHANGHAI) Co.,Ltd.; Shanghai Zhiben medical laboratory Co.,Ltd.
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2021-03-16
Anticipated expiration: 2040-01-20
Also published as: CN111292809A

Abstract

The present disclosure relates to a method, electronic device, and computer storage medium for detecting RNA level gene fusion. The method comprises the following steps: receiving whole genome comparison information and whole transcriptome comparison information; clustering for pairwise misadjusted read lengths to generate a plurality of large clusters; performing genome-wide level gene annotation for a large cluster meeting a first predetermined condition to generate a first gene combination name for identifying the large cluster based on a corresponding gene; performing a transcriptome-level gene annotation for the plurality of paired reads based on the transcriptome-wide alignment information to generate second gene combination names identifying the plurality of paired reads based on the corresponding genes; and identifying the same corresponding gene associated with the first gene combination name and the second gene combination name so as to determine the same corresponding gene as a potential fusion gene. The method is beneficial to the rapid and correct identification of the false positive, and can obviously reduce the false positive of the gene fusion detection result.

Description

Method, electronic device, and computer storage medium for detecting RNA level gene fusion

Technical Field

The present disclosure relates generally to bioinformation detection processing, and in particular, to methods, electronic devices, and computer storage media for detecting gene fusion.

Background

Gene Fusion (Fusion gene) is a process in which partial or complete sequences of two or more genes constitute a new hybrid gene, and is one of the important mechanisms leading to the development and progression of cancer. For example, LOXO-101 targeted drugs against NTRK fusion targets have broad drug effects against pan-cancer species, and data indicate that they can achieve an overall cancer control rate of 70-80%. Therefore, the gene fusion detection is of great significance to the design of targeted drugs for tumor prediction and clinical tumor treatment. Conventional fusion detection schemes based on RNA-level next-generation sequencing technologies include, for example: various methods are currently available for screening desired fusion genes using whole transcriptome sequencing or using predefined panels. For example, STAR-Fusion, Fusion seq, TopHat-Fusion, deFuse, Fusion Hunter, Fusion map, SoapFus, and the like.

In the conventional fusion detection scheme, a plurality of candidate fusions (candidates) are generated in a single sample generation result file, and the authenticity of the candidate fusions cannot be directly judged, so that a large number of false positives exist, and the accurate requirement of clinical examination cannot be directly met.

Disclosure of Invention

The present disclosure provides a method, an electronic device, and a computer storage medium for detecting RNA level gene fusion, which facilitate rapid and correct identification of false positives, and can significantly reduce false positives of gene fusion detection results.

According to a first aspect of the present disclosure, a method of detecting gene fusion is provided. The method comprises the following steps: receiving whole genome comparison information and whole transcriptome comparison information, wherein the whole genome comparison information and the whole transcriptome comparison information are respectively generated based on comparison results of double-end sequencing data and whole genome reference sequences and whole transcriptome reference sequences, and the double-end sequencing data comprise a plurality of paired reading lengths of a sample to be detected; clustering pairwise misadjusted reads to generate a plurality of large clusters, the pairwise misadjusted reads being obtained based on whole genome alignment information; performing genome-wide level gene annotation for a large cluster meeting a first predetermined condition to generate a first gene name combination for identifying the large cluster based on a corresponding gene; performing transcriptome-level gene annotation for the plurality of paired reads based on the transcriptome-wide alignment information to generate a second gene name combination identifying the plurality of paired reads based on the corresponding gene; and identifying the same corresponding genes associated with the first and second gene name combinations to identify the same corresponding genes as potential fusion genes.

Detecting gene fusion according to a second aspect of the present invention, there is also provided a computing device comprising: a memory configured to store one or more computer programs; and a processor coupled to the memory and configured to execute the one or more programs to cause the apparatus to perform the method of the first aspect of the disclosure.

According to a third aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium. The non-transitory computer readable storage medium has stored thereon machine executable instructions which, when executed, cause a machine to perform the method of the first aspect of the disclosure.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the disclosure, nor is it intended to be used to limit the scope of the disclosure.

Drawings

Fig. 1 shows a schematic diagram of a system 100 for a method of detecting RNA-level gene fusion according to an embodiment of the present disclosure;

fig. 2 shows a flow diagram of a method 200 for detecting RNA level gene fusion according to an embodiment of the disclosure;

FIG. 3 shows a flow diagram of a method 300 for determining the reliability of gene fusions according to an embodiment of the present disclosure;

FIG. 4 shows a flow diagram of a method 400 for leaving behind reliable potential fusion genes, according to an embodiment of the present disclosure;

fig. 5 shows a flow diagram of a method 500 for leaving behind reliable potential fusion genes, according to an embodiment of the present disclosure;

FIG. 6 shows a schematic diagram of fusing features that support breakpoints on clusters, according to an embodiment of the present disclosure;

fig. 7 shows a flow diagram of a method 700 for leaving behind reliable potential fusion genes, according to an embodiment of the present disclosure;

FIG. 8 illustrates a schematic diagram of classes of paired read lengths in a converged support cluster, according to an embodiment of the disclosure;

FIG. 9 shows a flow diagram of a method 900 for displaying potential gene fusions in accordance with an embodiment of the present disclosure;

FIG. 10 shows a schematic diagram of determining exon numbers to participate in fusion based on acquired igv images, according to an embodiment of the present disclosure;

fig. 11 shows a detection result diagram of a validation example according to an embodiment of the present disclosure;

fig. 12 shows a flow diagram of a method 1200 for detecting reliable RNA level gene fusion according to an embodiment of the disclosure; and

FIG. 13 schematically illustrates a block diagram of an electronic device 1300 that is suitable for use to implement embodiments of the present disclosure; and (c) and (d).

Like or corresponding reference characters designate like or corresponding parts throughout the several views.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Regarding the source of the alignment information, in the following embodiments, before performing, performing double-end sequencing on each sequencing fragment obtained by probe capture on a sample to be tested to obtain double-end data, where the double-end data includes a pair of paired read lengths; and then, comparing the obtained double-end data to a reference genome to obtain comparison information.

The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object.

In addition, the term "sequencing fragment" as used herein is generally an RNA library constructed by subjecting a test sample from a target individual to a library construction procedure adapted by a sequencing platform, and the composition of the RNA library is a random fragment of RNA of a certain length.

The term "read length" as used herein refers to the sequenced sequence obtained by sequencing the ends of the sequenced fragments. The term "paired read length" means: and two sequencing sequences from two ends of the same sequencing fragment obtained by double-end sequencing can be divided into different types of paired read lengths according to the comparison result of the paired read lengths on the whole genome reference sequence. The term "pair of offset reads" or "pair of offset reads" means: the results of the above alignment show that pairs of reads do not align to the genome-wide reference sequence at the normal mapping distance, orientation, or that neither or both of the reads do not align completely at one position to the genome-wide reference sequence, or that both pairs of reads do not align to the same chromosome, usually referred to as the PE discordant reads and clipped reads pairs. The term "read length start position" means: the origin of the mapping is at the position on the reference genome. The term "read length end position" means: the end point of the above mapping is at the position on the reference genome.

The term "mapping distance (intersize)" as used herein means: the distance between the start points of the pair of read lengths is the distance between the two start positions of the pair of read lengths. The term "mapping direction" means: refers to the direction of a read length alignment on a reference sequence. The term "mapping location" means: the position of the read length start point on the reference sequence.

The term "breakpoint" means: the break point of a read length is called the position of the base at the boundary of continuous matching and continuous mismatching with the reference sequence on the read length. The term "same breakpoint" means: all breakpoints that map to the same position on the genome-wide reference sequence are collectively referred to as the same breakpoint. The term "gene annotation" means: the gene structure was annotated precisely to that of the specific constitutive intron (intron), exon (exon). The term "gene name combination" means: the readouts (reads) of a particular region are annotated, and the annotated gene names are combined into a gene name combination that is used to identify the particular region. For example, if a gene is annotated for a specific region, and the names of the annotated genes are a gene and a gene B, respectively, the combination of the names of the genes for identifying the specific region is a + B. The same corresponding genes are identified by the same gene combination name, indicating that the same corresponding genes are annotated, and the same corresponding genes refer to the same name of the corresponding genes, and do not refer to the same specific gene structure.

It has been found that, in the above conventional approach for detecting gene fusion, many candidate fusions that cannot be directly judged to be true or false are generated in the detection result file generated for a single sample. In addition, when aligning to the reference genome at the RNA level, since RNA does not include intron (intron) regions, the challenge of "difficult specificity due to short fragment alignment to genome" is encountered, which leads to uncertainty in determining whether candidate genes are true or false, and thus leads to high false positives. Also, conventional protocols for detecting gene fusion have a large number of false positives caused by, for example, alternative splicing of adjacent genes, gene families, false genes, etc., and thus have limitations in accurately detecting gene fusion.

To address, at least in part, one or more of the above problems, as well as other potential problems, example embodiments of the present disclosure propose a scheme for detecting gene fusion. The scheme comprises the following steps: receiving whole genome comparison information and whole transcriptome comparison information, wherein the whole genome comparison information and the whole transcriptome comparison information are respectively generated based on comparison results of double-end sequencing data and whole genome reference sequences and whole transcriptome reference sequences, and the double-end sequencing data comprise a plurality of paired reading lengths of a sample to be detected; clustering pairwise misadjusted reads to generate a plurality of large clusters, the pairwise misadjusted reads being obtained based on whole genome alignment information; performing genome-wide level gene annotation for a large cluster meeting a first predetermined condition to generate a first gene name combination for identifying the large cluster based on a corresponding gene; performing transcriptome-level gene annotation for the plurality of paired reads based on the transcriptome-wide alignment information to generate a second gene name combination identifying the plurality of paired reads based on the corresponding gene; and identifying the same corresponding genes associated with the first and second gene name combinations to identify the same corresponding genes as potential fusion genes.

In the above scheme, whole genome alignment information and whole transcriptome alignment information generated by obtaining alignment results of both-end sequencing data with a whole genome reference sequence and a whole transcriptome reference sequence, respectively; and identifying the same corresponding gene by genome-wide level gene annotation for the large cluster aggregated by pair-wise disregulated reads obtained based on genome-wide alignment information, genome-wide level gene annotation for pair-wise reads based on genome-wide alignment information, and genome-wide level gene annotation, so as to determine it as a potential fusion gene; the method can simultaneously compare RNA sequencing data with a genome and a transcriptome, can obtain an effective candidate gene fusion combination pair, and avoids the problems of intron and the nonspecific short fragment comparison in the traditional method by extracting the transcript information pair and the region which purely support a certain gene fusion transcriptome, thereby obviously reducing the false positive of a detection result and increasing the detection rate and the accuracy of gene fusion.

Fig. 1 shows a schematic diagram of a system 100 for a method of detecting gene fusion according to an embodiment of the present disclosure. As shown in fig. 1, the system 100 includes an alignment unit 110, an extraction unit 112, a merging unit 114, an intersection unit 116, a filtering unit 118, a photographing unit 120, a fused subtype reading unit 122, and a receiving unit 124.

In some embodiments, the extraction unit 112, the merging unit 114, the intersection unit 116, the filtering unit 118, the fused subtype reading unit 122, and the receiving unit 124 may be configured on one or more computing devices 130. The comparing unit 110 and the photographing unit may be independent of the computing device 130. The computing device 130 may obtain the alignment information generated by the alignment unit 110. The comparison information of the comparison unit 110 includes, for example: double-ended sequencing data, whole genome alignment information, and whole transcriptome alignment information. The computing device 130 may interact with the comparison unit 110 in a wired or wireless manner.

The comparison unit 110 is configured to perform double-end sequencing on each sequencing fragment obtained by probe capture on a sample to be tested to obtain double-end data; then generating whole genome comparison information based on the comparison result of the double-end data and the whole genome reference sequence; and generating a full transcriptome alignment information based on the alignment of the double-ended data with the full transcriptome reference sequence. The genome-wide alignment information may include mapping directions, mapping positions, and intersizes of paired read length alignments on the genome-wide alignment. In some embodiments, the whole genome alignment information may further include: and finding out breakpoints respectively corresponding to different read lengths and the types of the paired read lengths according to the matching condition of the read lengths. Whole transcriptome alignment information includes, for example, the corresponding transcripts aligned by paired read lengths in double-ended sequencing data.

Extraction unit 112 is used to extract the pair-wise misregistered reads that are longer than the anomalies for subsequent analysis. The extraction unit 112 may extract the misaligned pairs of reads based on the alignment result indicating that the pairs of reads are not aligned to the whole genome reference sequence at the normal mapping distance and direction, or that one or both of the pairs of reads are not aligned to the whole genome reference sequence at one position, or that the pairs of reads are not aligned to the same chromosome.

The merging unit 114 is used for clustering, gene annotation, and data merging for read lengths. In some embodiments, merging unit 114 may cluster directly for pairwise offset read lengths to generate a plurality of large clusters. In some embodiments, the merging unit 114 may first cluster the pair of misadjusted read lengths to generate a plurality of small clusters according to whether the distance between the mapping positions corresponding to the pair of misadjusted read lengths satisfies a predetermined clustering distance; then, the small clusters with the same pairwise offset read length correspondence included in the small clusters are combined to generate a large cluster. The merging unit 114 may also perform genome-wide level gene annotation for large clusters that meet a first predetermined condition in order to generate a first gene combination name for identifying the large clusters based on the corresponding genes. Merging unit 114 may also perform transcriptome-level gene annotation for the plurality of paired reads based on the whole transcriptome alignment information to generate a second gene combination name for identifying the plurality of paired reads based on the corresponding gene.

Intersection unit 116 is used to identify potential fusion genes. For example, the intersection unit 116 may identify an intersection of a first gene combination name for identifying a large cluster and a second gene combination name for identifying a plurality of paired reads, i.e., the same corresponding genes with which the first gene combination name and the second gene combination name are associated, so as to determine the same corresponding genes as potential fusion genes. Intersection unit 116 may also determine potential supporting clusters for potential fusion genes based on supporting clusters having the same gene name.

The filtering unit 118 is used to filter the resulting aligned non-specific combinations (atti) at the RNA level and the alternative splicing of adjacent genes at the transcriptome level in order to leave reliable potential fusion genes, filtering out unreliable candidate fusion genes. RNA level fusion is a fusion that detects the level of transcription of a cell, and is more directly reflective of the expression of the fusion protein relative to the genomic (DNA) level of the cell. For example, the filtering unit 118 may determine whether the potential fusion gene satisfies the following removal condition. The removal conditions were: support clusters belonging to gene families, pseudogenes, predetermined transcript alignment non-specific combinations (atti, e.g., common transcript alignment non-specific combinations), ncrnas, or non-5 '-3' transcriptional constructs, or to adjacent genes and corresponding to two of the adjacent genes, respectively, are aligned relative to a reference transcriptome. If the above removal conditions are judged to be satisfied, the corresponding potential fusion gene is removed. Otherwise (i.e., removal conditions are not met), the potential fusion gene is left behind.

The photographing unit 120 is used for storing the reads support certificate of the gene combination in a photographing manner.

The fused subtype reading unit 122 is used to predict the fused subtype formed at the RNA level. The fusion subtype reading unit 122 may determine the fusion partner and the fusion subtype based on the potential fusion genes determined previously.

The receiving unit 124 is configured to receive breakpoint information and gene structure annotation information of successfully photographed gene combinations, and predict information of the fusion transcript.

A method for detecting RNA level gene fusion according to an embodiment of the present disclosure will be described below with reference to fig. 2. Fig. 2 shows a flow diagram of a method 200 for detecting gene fusion according to an embodiment of the present disclosure. It should be understood that the method 200 may be performed, for example, at the electronic device 1300 depicted in fig. 13. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 200 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 202, the computing device 130 may receive whole genome alignment information and whole transcriptome alignment information generated based on alignment results of double-ended sequencing data with a whole genome reference sequence and a whole transcriptome reference sequence, respectively, the double-ended sequencing data including a plurality of paired read lengths of a sample to be tested. In some embodiments, computing device 130 receives genome-wide alignment information and transcriptome-wide alignment information from alignment unit 110 via a configured communication unit (not shown in fig. 1).

With respect to whole genome alignment information, in some embodiments, whole genome alignment information comprises: mapping direction and mapping position on the whole genome obtained by each read length alignment. In some embodiments, the whole genome alignment information further comprises: and at least one of the intersizes obtained by the paired read length comparison, breakpoints respectively corresponding to different read lengths found according to the matching condition of the read lengths, and the types of the paired read lengths. Regarding the whole transcriptome alignment information, in some embodiments, it comprises: paired reads in the double-ended sequencing data align to corresponding transcripts. At block 204, the computing device 130 may cluster for pairwise misadjusted reads to generate a plurality of large clusters, the pairwise misadjusted reads being obtained based on the genome-wide alignment information. For example, the computing device 130 can screen out pairwise deregulated reads (PE discordant reads) based on the specific alignment information obtained at block 202 for aligning RNA transcriptome double-ended sequencing data to a genome-wide reference sequence; clustering is then performed for the pairwise misadjusted read lengths to generate a plurality of large clusters.

The manner of generating the plurality of large clusters includes various ways. In some embodiments, the manner in which the plurality of large clusters are generated includes: the computing device 130 may cluster the misadjusted pair read lengths to generate a plurality of large clusters based directly on a predetermined clustering rule (e.g., spacing between mapped locations to which the misadjusted pair read lengths correspond). In some embodiments, the computing device 130 may also first cluster the pairwise misadjusted read lengths to generate a plurality of small clusters according to whether the spacing between the mapping positions corresponding to the misadjusted read length pairs satisfies a predetermined clustering distance; then, the included small clusters with the same pairwise offset read length correspondence are merged to generate a large cluster. Hereinafter, a description will be given of a manner of "generating small clusters first and then merging the small clusters into a large cluster".

First, if the computing device 130 determines that the spacing between the mapped locations corresponding to the pair-wise misaligned read lengths satisfies the predetermined clustering distance, clustering is performed for the pair-wise misaligned read lengths to generate a plurality of small clusters. For example, if the computing device 130 can determine the distance between the end position of the current read length and the start position of the next adjacent read length, then determine whether the distance satisfies a predetermined clustering distance (the predetermined clustering distance is, for example, less than or equal to 500 and 1000 bp). For example, the distance between the end position of the first read length and the start position of the adjacent read length may be compared to a predetermined clustering distance, and so on until it appears that the distance between the end position of one read length and the start position of the adjacent read length exceeds the predetermined clustering distance, then the read length is the last read length in the small cluster. By analogy, a plurality of small clusters can be generated.

Then, the computing device 130 determines whether the correspondence relationship between the pair-wise maladjustment read lengths included in the small cluster is the same; and if the distance between the small clusters is determined to meet the preset distance, combining the small clusters to generate a large cluster. The computing device 130 may combine the spacings in all of the above-obtained small clusters to satisfy a predetermined distance (e.g., less than 500 bp) to obtain each large cluster. By adopting the above means, the small clusters are firstly formed on the basis of the distance between the mapping positions corresponding to the maladjusted read length pairs and the preset clustering distance, and the small clusters with the distance meeting the preset distance are combined into the large cluster, which is beneficial to improving the determination precision of the potential gene fusion.

At block 206, the computing device 130 may perform genome-wide level gene annotation for a large cluster meeting a first predetermined condition to generate a first gene name combination for identifying the large cluster based on the corresponding gene. Wherein the first predetermined condition comprises, for example, the large cluster comprising a number of pairwise offset read lengths greater than or equal to a first predetermined logarithm.

For example, the computing device 130 performs genome-wide annotation of reads in a large cluster by starting location (start location) to identify the gene name combination in which each read resides to generate a first gene name combination identifying the large cluster. In some embodiments, the computing device 130 may further record the read of each marked gene name combination and the corresponding gene name combination thereof according to the PE relationship with the same name id (the prefix of the name id of R1 and R2 in a pair of PE reads is the same).

In some embodiments, the computing device 130 performs genome-wide annotation on the regions corresponding to the large clusters meeting the first predetermined condition (e.g., the logarithm of the pair-wise misadjusted read length is equal to or greater than the first predetermined logarithm) to obtain first gene name combinations corresponding to the large clusters, and combines the large clusters with the same first gene name combination to obtain the supporting clusters corresponding to the first gene name combination. For example, the computing device 130 first determines whether the logarithm of the pair-wise misadjusted read lengths included in the large cluster generated at block 204 satisfies greater than or equal to a first predetermined logarithm (e.g., without limitation, 5), for which the first predetermined logarithm is satisfied, the computing device 130 proceeds with genome-wide level gene annotation to arrive at the corresponding first gene name combination. For example, if the gene is annotated with the names of gene A and gene B, the gene combination name is A + B.

At block 208, the computing device 130 performs a full transcriptome-level gene annotation for the plurality of paired reads based on the full transcriptome alignment information to generate a second gene name combination for identifying the plurality of paired reads based on the corresponding genes. For example, the computing device 130 may compare the transcripts aligned by PE paired reads id and perform whole transcriptome-level gene annotation on the corresponding genes based on the specific comparison information obtained at block 202 by comparing the RNA transcriptome double-ended sequencing data with the whole transcriptome reference sequence, to obtain the PE read id and the name combination of the corresponding gene.

At block 210, the computing device 130 may identify the same corresponding genes with which the first and second gene name combinations are associated to determine potential fusion genes based on the same corresponding genes. For example, the computing device 130 may perform a logical intersection process on a first gene name combination generated at block 206 based on the genome-wide level gene annotations and a second gene name combination generated at block 208 based on the genome-wide level gene annotations, resulting in a gene name combination where reads ids on the same corresponding gene are aligned by both the genome-wide and the transcriptome and correspond to each other. Then, carrying out quantity statistics on reads of different corresponding genes to obtain PE discordant reads support logarithms and reads ids files spanning different combined genes.

In the scheme, the gene RNA fusion detection method takes the gene combination which is annotated as the same gene name combination and is respectively obtained at the genome level and the transcriptome level as the corresponding potential fusion gene, and the support cluster with the same gene name is determined as the potential support cluster of the potential fusion gene, and the reliable fusion gene can be found out by continuously judging the reliability, thereby avoiding the problem of high false positive caused by short fragment comparison without containing intron regions when RNA fragment reads singly pass through the whole genome to detect the fusion gene, and simultaneously avoiding the problem of missing detection of the fusion gene caused by non-specific comparison when the fusion gene is detected by singly and directly comparing the transcript level due to the variable splicing problem of a plurality of transcripts or the partial cross-linking of RNA fragments, namely the gene RNA fusion detection method can reduce the probability of false positive, but also can improve the detection rate of the actual fusion gene.

In some embodiments, the method 200 further comprises: the computing device 130 merges the large clusters identified by the same gene combination name to generate supporting clusters associated with the same gene combination name; potential supporting clusters of potential fusion genes are determined based on supporting clusters having the same gene name. For example, after annotating each large cluster satisfying the first predetermined logarithm, the computing device 130 merges the large clusters having the same first gene name combination, so as to obtain the supporting cluster corresponding to the gene combination name. For example, if the first gene name combination after annotation of two large clusters is a + B, where the pair-wise disorder logarithm included in one large cluster is 5, and the pair-wise disorder logarithm included in the other large cluster is 6, the two large clusters are merged to obtain 5+6=11 pairs of pair-wise disorder logarithms. The computing apparatus 130 may determine the 11 pairs of deregulated logarithm described above as a supporting cluster of the gene combination corresponding to the gene combination name a + B.

In some embodiments, method 200 further comprises a method of reliability determination of the potential fusion gene. The following will explain the relevant manner of the method for determining the reliability of the fusion gene with reference to fig. 3 to 8, and will not be described herein again.

In some embodiments, method 200 also includes a method of visualization of the potential fusion gene. The following will explain the relevant manner of the method for determining the reliability of the fusion gene with reference to fig. 9 to 10, and will not be described herein again.

Fig. 3 shows a flow diagram of a method 300 for determining the reliability of gene fusion according to an embodiment of the present disclosure. It should be understood that the method 300 may be performed, for example, at the electronic device 1300 depicted in fig. 13. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 300 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 302, the computing device 130 may determine whether the number of pairwise misadjusted read lengths included by the potential supporting cluster is greater than or equal to a second predetermined logarithm.

At block 304, if the computing device 130 determines that the number of pairwise disregulated read lengths included in the potential support cluster is greater than or equal to a second predetermined logarithm, it is determined whether the potential fusion gene satisfies the following removal condition: (ii) at least one deletion of a non-specific combination, non-coding ribonucleic acid (ncRNA), of a transcript alignment predetermined by pseudogenes belonging to a gene family; support cluster read length directions which belong to adjacent genes and are obtained by comparing two genes in the adjacent genes with a reference transcriptome are opposite; or not belonging to a predetermined transcription structure (the predetermined transcription structure is, for example, a quintet transcription structure). For example, the computing device 130 may perform a removal of the identification of gene families, pseudogenes, adjacent genes and corresponding support clusters with opposite read orientations, common transcript alignments, alti combinations, ncrnas, non-quintet transcription structures (non-5 '-3' transcription structures) for more than 5 pairs of reads supported for the potential gene combinations determined at block 210, and then perform subsequent processing on the remaining potential gene combinations.

In some embodiments, if two genes in the potential fusion gene are adjacent genes, and the support cluster read length directions obtained from alignment of the two genes with the reference transcriptome are opposite. For example, potential gene combinations a and B are determined as neighboring genes, and the support cluster read direction obtained from the alignment of the reference transcriptome corresponding to gene a is opposite to the support cluster read direction obtained from the alignment of the reference transcriptome corresponding to gene combination B. The above relative situation refers, for example, to the formation of alternatively spliced 5 '-3' transcripts of the gene combination A-B. By adopting the above means, potential gene combinations which satisfy adjacent genes but are not opposite to the read length direction of the corresponding support cluster are left, and omission can be avoided.

At block 306, if the computing device 130 determines that the potential fused gene does not satisfy the removal condition, the potential fused gene is left. Conversely, if the computing device 130 determines that the potential fused gene satisfies the removal condition, then the fused gene is removed at block 310.

At block 308, the computing device 130 determines that the potential fusion genes left are reliable potential fusion genes.

By employing the above means, the present disclosure can improve the reliability of the identified potential fusion gene.

In some embodiments, method 300 further includes a method for filtering potential fusion genes to leave reliable potential RNA fusions. The method of leaving a reliable potential fusion gene is described in detail below in conjunction with FIGS. 4-7.

Fig. 4 shows a flow diagram of a method 400 for leaving behind reliable potential fusion genes, according to an embodiment of the present disclosure. It should be understood that the method 400 may be performed, for example, at the electronic device 1300 depicted in fig. 13. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 400 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 402, the computing device 130 may determine whether the potential fusion gene does not satisfy at least one removal condition.

At block 404, if the computing device 130 can determine that the potential fusion gene does not satisfy the removal condition, the pair-wise deregulated reads corresponding to the corresponding gene and the transcribed reference sequences of the predetermined transcripts of the corresponding gene are individually aligned to determine the quality of the alignment. Conversely, if the computing device 130 determines that the potential fusion gene satisfies any of the removal conditions, then at block 412 the potential fusion gene is removed.

For example, if a potential fusion gene does not belong to at least one of a gene family, a pseudogene, a predetermined transcript alignment non-specific combination, ncRNA, a non-quintet transcriptional structure, or to adjacent genes, and the read length directions of the supporting clusters obtained by aligning two genes of the adjacent genes with the reference transcriptome are opposite, the alignment quality is further determined. For example, the first gene name set of a potential fusion gene is a + B, and the corresponding genes in the first gene name set are a and B. The pair-wise deregulated reads corresponding to the corresponding gene a in the potential support cluster are aligned individually with the transcriptional reference sequence of the single common transcript of that gene. In addition, the pair-wise deregulated reads corresponding to the corresponding gene B in the potential support cluster are individually aligned with the transcriptional reference sequence of the single common transcript of that gene. The quality of the alignment is generated based on the two individual alignments separately.

At block 406, the computing device 130 determines whether the comparison quality is greater than or equal to a predetermined comparison quality. In some embodiments, the predetermined alignment quality is, for example, 10. For example, after the alignment quality is determined, the alignment quality is considered to be greater than or equal to 10.

At block 408, if the computing device 130 determines that the alignment quality is greater than or equal to the predetermined alignment quality, it is determined whether the number of pairwise de-aligned read lengths in the potential support cluster that satisfy the predetermined filter condition is greater than or equal to a third predetermined logarithm. If it is determined that the comparison quality is less than the predetermined comparison quality, then corresponding removal is performed at block 412.

For example, if the quality of the above-mentioned single alignment for the corresponding gene a is greater than or equal to 10, and the quality of the single alignment for the corresponding gene B is less than 10, all the pair-wise misadjusted read lengths corresponding to the corresponding gene a can be extracted from the potential support cluster to be filtered according to the predetermined filtering condition, and it is determined whether the logarithm of all the pair-wise misadjusted read lengths satisfying the predetermined filtering condition is greater than or equal to a third predetermined logarithm.

With respect to the predetermined filtering condition, in some embodiments, it is, for example: pairwise reads are aligned on a whole genome reference sequence with at least one end being a full alignment or both ends being partial alignments. In some embodiments, the third predetermined logarithm is, for example, 5.

At block 410, if the computing device 130 determines that the number of pairs of dysregulated read lengths that satisfy the predetermined filtering condition is greater than or equal to a third predetermined logarithm, potential fusion genes remain. If the number of pairwise de-aligned read lengths in the potential support cluster that satisfy the predetermined filter condition is small, e.g., less than a third predetermined logarithm, then a corresponding removal is performed to block 412. For example, when the logarithm of all pairs of misadjusted read lengths in a potential support cluster that satisfy a predetermined filtering condition satisfies greater than or equal to a third predetermined logarithm, the corresponding potential fusion gene is left as a reliable potential gene; and using all corresponding pair-wise disregulated reads that satisfy the predetermined filtering condition as fusion support clusters of the corresponding potential fusion genes.

By adopting the above means, the present disclosure can further improve the reliability of the identified potential fusion gene.

Fig. 5 shows a flow diagram of a method 500 for leaving behind reliable potential fusion genes, according to an embodiment of the present disclosure. It should be understood that the method 500 may be performed, for example, at the electronic device 1300 depicted in fig. 13. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 500 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 502, the computing device 130 may determine fusion supporting clusters corresponding to potential fusion genes having an alignment quality greater than or equal to a predetermined alignment quality. In some embodiments, the predetermined alignment quality is, for example, 10. In some embodiments, the computing device 130 may perform a single alignment of each of the combined gene reads ids obtained at block 404 with the reference sequence of a single-gene single common transcript, extract reads that align with better quality of alignment (MAPQ) onto a pair of transcripts of each of the combined gene pairs, filter to leave at least one end of two reads in the pair of reads perfectly matched (where matched) or clipped reads pairs, form each of the combined gene mini-sam files, and recalculate the pairs of reads supported pairs contained in the mini-sam file as the pairs of fused supported reads for the combined gene pair.

At block 504, the computing device 130 merges coverage regions of the locations of the fusion support clusters corresponding to the full transcriptome reference sequences at adjacent predetermined ranges of distances to generate a merged region. In some embodiments, the predetermined spacing ranges from 0-500 bp. For example, the individual pairwise disregulated read lengths in the fusion support cluster determined at block 502 are correspondingly distributed to different locations on the whole transcriptome reference sequence. And (3) viewing the distance between the adjacent offset reading lengths according to the sequence of the comparison positions from the first offset reading length, and combining the coverage areas of the offset reading length and the previous offset reading length to generate a combined area if the distance between one offset reading length and the next adjacent offset reading length is determined not to belong to the range of 0-500 bp. And repeating the steps until all the coverage areas with maladjustment read lengths in the fusion support cluster are combined, and further generating a plurality of different combination areas.

At block 506, the computing device 130 determines whether the number of merge regions is less than or equal to a predetermined number. In some embodiments, the predetermined number is, for example and without limitation, 6. For example, a determination is made whether the number of merge regions generated at block 504 is less than or equal to 6.

At block 508, if the computing device 130 determines that the number of merge regions is less than or equal to the predetermined number, potential fusion genes remain. If the computing device 130 determines that the number of merge regions is greater than the predetermined number, then at block 510 the potential fusion gene is removed. For example, the computing device 130 may be configured to perform region merging on the reads distribution (regions) of each gene combination pair obtained at block 502 to form a merged region, and if the number of regions after merging is less than 6, then the potential fused genes are left. Further, the computing device 130 may extract the transcript-exon annotated bed of the corresponding gene pair, and load may be sent to igv for subsequent photographing.

It is understood that the number of regions included in the gene combination after the regions are combined reflects the specificity of the alignment. The reliability filtering method mentioned above is mainly based on the predetermined condition, there may be the non-specificity of the alignment caused by the unexpected condition, and the non-specificity of the alignment may occur the situation that is not well recognized, and the non-specificity of the alignment can be recognized by using the specificity of the alignment reflected by the plurality of merging regions. Thus, by employing the above approach, the present disclosure is able to more accurately determine potential fusion gene reliability. In addition, the method is favorable for avoiding the problem of missed detection, can ensure the treatment efficiency and improve the treatment speed.

In some embodiments, if it may be determined at block 508 that the number of merge regions is less than or equal to the predetermined number, it may be further determined whether the corresponding merge support cluster satisfies at least one of the following conditions: the break points existing in the fusion support cluster meet a preset consistency condition; the alignment of the fusion support clusters on the transcripts is a continuous alignment. If it is determined that the fusion supporting cluster satisfies at least one of the above conditions, that is, the breakpoint present in the fusion supporting cluster satisfies the predetermined consistency condition or the alignment of the fusion supporting clusters on the transcript is a continuous alignment, the potential fusion gene is left as a reliable potential fusion gene.

With respect to the predetermined consistency condition, in some embodiments, it includes, for example: the same sequences which cannot be continuously aligned exist among all the maladjustment read lengths with the same breakpoint in the fusion support cluster by taking the same breakpoint as a starting point, and the number of all the maladjustment read lengths with the same breakpoint is more than or equal to 2.

With respect to sequential alignment, in some embodiments, it includes, for example: the alignment of each paired read in the fusion support cluster covers a region on the transcript that is contiguous, not a discontinuous region with gaps.

The following describes a case where a break point existing in the fusion supporting cluster satisfies a predetermined consistency condition with reference to fig. 6. FIG. 6 shows a schematic diagram of features to fuse breakpoints on supporting clusters, according to an embodiment of the present disclosure. As shown in FIG. 6, the offset reads with the same break point are aligned with the reference genome in the vertical arrangement, the gray part indicated by 610 represents the matching part of each map with the same break point, and the black part indicated by 620 represents the non-matching part. In FIG. 6, there are consecutive identical sequences in the portions that cannot be aligned by the misaligned reads, and the break points of these fusion supporting clusters satisfy the consistency condition. Thus, potential fusion genes remain to be determined as reliable potential genes.

In some embodiments, method 500 further includes method 700 for filtering potential fusion genes so as to leave reliable potential fusion genes. Fig. 7 shows a flow diagram of a method 700 for leaving behind reliable potential fusion genes, according to an embodiment of the disclosure. It should be understood that method 700 may be performed, for example, at electronic device 1300 depicted in fig. 13. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 700 may also include additional acts not shown and/or may omit acts shown, as the scope of the present disclosure is not limited in this respect.

At block 702, the computing device 130 aligns the paired reads in the corresponding fusion support clusters over the genome-wide reference sequence to determine the class of the paired reads based on whether a breakpoint exists at both ends.

In some embodiments, the determined categories of paired read lengths include: no breakpoints on both ends (e.g., class I), breakpoints on both ends (e.g., class II), and breakpoints on only one end (e.g., class III). FIG. 8 illustrates a schematic diagram of classes of paired read lengths in a converged support cluster, according to an embodiment of the disclosure. In FIG. 8, 810 indicates a category of "no break at both ends," where both read lengths in a pair have no break, i.e., both ends match perfectly (wheel Mapping), without crossing the break shown by dashed

lines

840 and 842 in FIG. 8. 820 is of the type "breakpoint at both ends" (Two BP), i.e., Two of the paired reads have breakpoints as indicated by dashed

lines

840 and 842 in fig. 8. 830 is of the type "breakpoint only at One end" (One Partner BP), i.e., One of the two read lengths of the pair has a breakpoint indicated by dashed line 840 in fig. 8 and the other has no breakpoint indicated by dashed line 842 in fig. 8.

At block 704, the computing device 130 determines whether at least two different types of paired read lengths are included in the converged support cluster.

At block 706, if the computing device 130 determines that at least two different types of paired reads are included in the fusion support cluster, it is determined whether there are overlapping fusion regions for the corresponding regions of the different types of paired reads on the genome-wide reference sequence. If the computing device 130 determines that the at least two different types of paired read lengths are not included in the fusion support cluster, then the potential fusion gene is removed at block 710.

In some embodiments, if the computing device 130 determines that at least two types of paired read lengths are included in the fusion support cluster, e.g., there are two types of paired read lengths, class I as indicated at 810 and class II as indicated at 820, it is further determined whether there are coincident fusion regions for corresponding regions on the genome-wide reference sequence for the different types of paired read lengths. The following table one schematically shows the classes of paired reads 1 through 6 and the corresponding regions on the whole genome reference sequence. As can be seen from Table one, all the regions corresponding to all the paired read lengths of type I are 5-70. All paired read lengths of type II correspond to all regions 18 to 85. It is then determined whether there is a mutual coverage area between the two areas, so as to be a fusion area.

At block 708, if the computing device 130 determines that there are overlapping fusion regions for the different types of paired reads at corresponding regions on the whole genome reference sequence, the potential fusion genes are left. If the computing device 130 determines that there are no coincident fusion regions, then the potential fusion gene is removed at block 710.

For example, as shown in Table one, all regions corresponding to all pairs of read lengths of type I are 5-70. All paired read lengths of type II correspond to all regions 18 to 85. The fusion region 18-70, which overlaps with each other, is present between the two regions, leaving the potential fusion gene, i.e., the fusion region of the identified fusion gene as authentic, as 18-70. Conversely, if a reliable support cluster includes only one type of paired read length, the potential fusion gene is discarded. If a reliable support cluster includes two types and more of paired read lengths, then a determination is continued as to whether there are overlapping fusion regions, and if so, the potential fusion gene is left. For reads with clear consistent breakpoints and aligned reads on a continuous transcriptome of reads, reliable gene fusion can be considered. Further computing device 130 may perform de novo gene structure annotation for consistent gene fusions and their involvement in the exon subtypes.

By adopting the above means, the present disclosure enables more accurate determination of the reliability of a potential fusion gene. In addition, the fusion region where the determined pairs of reads of different types overlap in the corresponding region on the whole genome reference sequence is actually the region where the fusion gene is actually fused, so the present disclosure can also realize the fusion site and reason obtained based on the annotation analysis of the fusion region.

Fig. 9 shows a flow diagram of a method 900 for displaying potential gene fusions according to an embodiment of the present disclosure. It should be appreciated that method 900 may be performed, for example, at electronic device 1300 depicted in fig. 13. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 900 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 902, the computing device 130 reads all mischievous readings in the fusion support cluster that have the same breakpoint and that are aligned with the arrow pointing to the right, and determines the exon number that is included in the positions that can be matched from the exon number 1 to the position before the breakpoint position and that is the 5' end participating in fusion. For example, computing device 130 acquires igv images of potential genetic fusions left behind in order to determine exon (exon) numbers involved in the fusion based on igv images. Fig. 10 shows a schematic diagram of determining exon numbers to participate in fusion based on acquired igv images, according to an embodiment of the present disclosure. As shown in FIG. 10, FIG. 10 includes a set of read lengths. The set of read long arrows are directed entirely to the right (not shown). The Exon number of the breakpoint position of the part which is not completely matched is, for example, Exon N, and the breakpoint is used as a boundary and is divided into a matchable region (Match region) and an unmatched region (SoftClip region), and the matchable region refers to all regions in the set of reads which can be matched with the partial coverage of the reference book. The area on unmatch refers to all areas in the set of reads that do not match the partial coverage on the reference book, and the exon numbers run from the exon1 at the leftmost end of the matched area to the last exon N to the rightmost end (the extreme end). Where Exon1 begins at a position where it can match (i.e., Exon N-1), all Exon numbers included are those mentioned above as being involved in fusion at the 5' end.

At block 904, the computing device 130 reads for all miscalculations in the fusion support cluster that have the same breakpoint and that are aligned with the arrow pointing to the left, determines the exon numbers of the first matched position to the right of the breakpoint position to the exon numbers participating in the fusion to the right until the endmost exon number is the 3' end. As shown in FIG. 10, the position on the first match to the right of the breakpoint is exon +1, and the last exon N at the end is the exon number participating in the fusion at the 3' end.

By adopting the means, the obtained breakpoints, different gene combinations in the gene name combinations and the comparison condition of each paired read length in the reliable support cluster can be comprehensively and visually displayed, so that the guidance effect on the targeted medication can be realized.

In some embodiments, the computing device 130 identifies as a reliable gene fusion, for example, a clear consistent breakpoint and aligned reads on a continuous transcriptome of reads, annotates the gene structure de novo, resulting in a consistent gene fusion and its involvement in the exon subtype. The computing device 130 may present the obtained breakpoint, different gene combinations in the gene name combination, and comparison conditions of each paired read length in the reliable support cluster, and may perform detailed gene annotation again according to the same breakpoint position to find out a specific participating exon subtype, where the result is the reliable fusion gene and the found corresponding fusion region part. In some embodiments, the computing device 130 may also present information such as breakpoints, subsamples, combinations of genes, and gene function impacts in a summary manner.

Fig. 11 shows a detection result diagram of a verification example according to an embodiment of the present disclosure. Fig. 11 shows the detection result of FFPE sample (119C 5124S1N 1) for the validation example. This sample is known to have an LMNA-NTRK1 fusion gene, and the results obtained by detecting it by the method of example are shown in FIG. 11. As shown in fig. 11, it can be seen intuitively from the figure that the regions found by the method of the embodiment are shown as the fusion of the LMNA gene and the NTRK1 gene, the LMNA gene on the left and the NTRK1 gene on the right, and the exon subtypes specifically involved in the fusion of the two gene distributions can be found, and the distribution of the paired read lengths supporting the fusion gene and the breakpoint position can also be seen. The following table two specifically shows the information related to the determined fusion of the LMNA gene and the NTRK1 gene.

Because the gene RNA fusion detection method disclosed by the invention takes the gene combination which is annotated as the same gene name combination and is respectively obtained on the genome level and the transcriptome level as the corresponding potential fusion gene, reads the pairwise misadjustment with the same gene name as the potential support pair of the potential fusion gene, and can continuously find out the reliable fusion gene through reliability judgment, the problem of high false positive caused by short fragment comparison because no intron region is contained when the RNA fragment is singly used for detecting the fusion gene through the whole genome is avoided, and the problem of fusion gene missing detection caused by non-specific comparison when the fusion gene is detected because a plurality of transcripts are subjected to variable splicing or the RNA fragments are partially crosslinked when the RNA fragment is singly and directly compared to the transcript level is avoided, namely the gene RNA fusion detection method can reduce the probability of false positive, but also can improve the detection rate of the actual fusion gene.

Fig. 12 shows a flow diagram of a method 1200 for detecting reliable RNA level gene fusion according to an embodiment of the disclosure. It should be understood that the method 1200 may be performed, for example, at the electronic device 1300 depicted in fig. 13. May also be executed at the computing device 130 depicted in fig. 1. It should be understood that method 1200 may also include additional acts not shown and/or may omit acts shown, as the scope of the disclosure is not limited in this respect.

At block 1202, the computing device 130 may receive genome-wide alignment information and transcriptome-wide alignment information. For example, mRNA can be extracted for a sample, and then post-capture on-machine sequencing (e.g., PE 150x2, 20M target sequencing data volume) by a target platform (target panel) to obtain a fastq file. An alignment (bwa mem) with the reference genome hg19 was then performed on the obtained fastq file in order to obtain an alignment result file in bam format. Additionally, an alignment to the human hg19 reference transcriptome (e.g., minimap2 and genpole tandript. fa) was performed for the fastq file described above to obtain a paf alignment results file.

At block 1204, the computing device 130 clusters the pair-wise misaligned read lengths that meet a predetermined condition to generate a plurality of large clusters. For example, the computing device 130 may filter out the subset.bam file regarding the candidate chimeric reads (chimeric reads) from the dedup.bam in the above bam format alignment result file according to the intersize parameter (whether greater than 2 kb) and cigar (e.g., [ HS ], but not [ DI ]) aligned between the pair-wise disregulated reads (PE reads). The computing device 130 then converts the obtained subset.bam files into beads and merges (merge, -d 10) for clustering (cluster, -d 10). For example, read lengths (reads) that cluster pairs of PE reads into the same multiple regions (regions) are merged together.

At block 1206, the computing device 130 performs genome-wide level gene annotation for the large cluster to generate a first gene name combination for identifying the large cluster based on the corresponding gene. For example, the computing device 130 performs gene annotation for the merged region, obtains read ids (reads ids) files corresponding to each gene combination at the genome level, and counts the read counts (reads counts). The readlength ids file may indicate the first gene name combination. Then, a gene annotation is made with respect to the support regions (support regions) of the obtained plurality of support.sub.bam files, so as to obtain a list of candidate gene combinations.

At block 1208, the computing device 130 performs a full transcriptome-level gene annotation for the plurality of paired reads to generate a second gene name combination for identifying the plurality of paired reads based on the corresponding gene. For example, the computing device 130 annotates and clusters the transcripts aligned by the pairwise relationship (pair) obtained in block 1206, and then annotates the gene names of the transcripts, obtaining a readlength identification (reads ids) file corresponding to each gene combination at the transcriptome level, which may indicate a second gene name combination.

At block 1210, the computing device 130 determines a potential RNA level of gene fusion based on the corresponding coincident genes of the first gene name combination and the second gene name combination. For example, the computing device 130 generates a potential gene fusion list file based on aligning the read length identifiers (reads ids) file corresponding to each gene combination at the genome level obtained at block 1206 and the read length identifiers (reads ids) file corresponding to each gene combination at the transcriptome level obtained at block 1208 for corresponding genes that are consistent. By employing the above-described hand-ends, the present disclosure can obtain an effective RNA potential (candicate) gene fusion combinatorial pair by simultaneously comparing the genome and transcriptome through RNA sequencing data.

At block 1212, if the computing device 130 determines that the potential RNA level gene fusion belongs to a gene family, a pseudogene, an adjacent gene, and that the support cluster read length directions obtained via alignment with the reference transcriptome for two of the adjacent genes respectively are opposite, the predetermined transcript alignment removes the potential RNA level gene fusion, and otherwise leaves the potential RNA level gene fusion. For example, the computing device 130 removes a combination of genes if the potential RNA level gene fusion determined at block 1210 belongs to the same gene family, pseudogene, RNA alignment common atti, non-quintet transcriptional structure, or belongs to adjacent genes and the support cluster read length direction obtained from the reference transcriptome alignment for each of the two of the adjacent genes is opposite, against a configuration (config) file for the filtering condition. And extracting the read length identification (reads id) corresponding to each gene combination in the reserved gene combination list into a support.

At block 1214, the computing device 130 merges coverage areas of the locations of the fusion support clusters corresponding to the full transcriptome reference sequences at adjacent predetermined ranges of distances to generate a merged region, and filters the merged region. For example, the computing device 130 may compare the support.ids file minimap of the respective gene combination pairs obtained at block 1212 to the tandript.fa of the most common transcript for a single gene (e.g., hg19 reference transcriptome) to obtain a sam file, and then merge the region distributions of reads (reads) in the sam file to form a merged region (-d 500). Then, gene combination pairs having a number of the merged regions of 6 or less were filtered.

At block 1216, the computing device 130 leaves the potential fusion gene if it is determined that there are overlapping fusion regions for the pairs of reads of different types in the corresponding regions on the whole genome reference sequence. For example, the transcript alignment sam file for each gene combination pair obtained (left) at block 1214 is subjected to calculation of alignment start position diversity of reads in each merged region, the number of regions having diversity of 2 or more is counted, and none of the gene combination pairs satisfying the foregoing condition out of the total number of merged regions is filtered.

At block 1218, computing device 130 acquires igv images of the potential gene fusions left behind in order to determine, based on the igv images, the exon (exon) numbers involved in the fusion to read the subtype of the fusion gene. For example, the computing device 130 proceeds to igv for a igv snapshot (snapshot) based on the bam and corresponding transcript exon structure information of the filtered gene combination pairs obtained at block 1216, as well as the interval information. Then, based on the long characteristics read in the photographed picture, the truly fused gene combination pair is finally determined according to the gene fusion mode (pattern). By adopting the above means, the present disclosure can confirm gene fusion by photographing intuitively and effectively according to the support reading format (support reads pattern) of positive RNA fusion.

FIG. 13 schematically illustrates a block diagram of an electronic device 1300 that is suitable for use to implement embodiments of the present disclosure. The apparatus 1300 may be an apparatus for implementing the

methods

200, 700, 900 and 1200 shown in fig. 2-6. As shown in fig. 7, device 1300 includes a Central Processing Unit (CPU) 1301 that may perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM) 1302 or computer program instructions loaded from a storage unit 1308 into a Random Access Memory (RAM) 1303. In the RAM1303, various programs and data necessary for the operation of the device 1300 can also be stored. The CPU 1301, the ROM 1302, and the RAM1303 are connected to each other via a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.

A number of components in the device 1300 connect to the I/O interface 1305, including: input unit 1306, output unit 1307, storage unit 1308, processing unit 1301 perform the various methods and processes described above, for example,

methods

200 and 500, 700, 900 and 1200. For example, in some embodiments, the

methods

200, 700, 900 and 1200 may be implemented as a computer software program stored on a machine-readable medium, such as the storage unit 1308. In some embodiments, some or all of the computer program may be loaded onto and/or installed onto device 1300 via ROM 1302 and/or communications unit 1309. When the computer program is loaded into RAM1303 and executed by CPU 1301, one or more of the operations of

methods

200, 700, 900 and 1200 described above may be performed. Alternatively, in other embodiments, the CPU 1301 may be configured in any other suitable manner (e.g., by way of firmware) to perform one or more of the acts of the

methods

200, 500, 700, 900 and 1200.

It should be further appreciated that the present disclosure may be embodied as methods, apparatus, systems, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for carrying out various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

The above are only alternative embodiments of the present disclosure and are not intended to limit the present disclosure, which may be modified and varied by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A method for detecting RNA level gene fusion, comprising:

receiving whole genome comparison information and whole transcriptome comparison information, wherein the whole genome comparison information and the whole transcriptome comparison information are respectively generated based on comparison results of double-end sequencing data and whole genome reference sequences and whole transcriptome reference sequences, and the double-end sequencing data comprises a plurality of paired reading lengths of a sample to be detected;

clustering pairwise disregulated reads to generate a plurality of large clusters, the pairwise disregulated reads being obtained based on whole genome alignment information;

performing genome-wide level gene annotation for the large cluster meeting a first predetermined condition to generate a first gene name combination for identifying the large cluster based on a corresponding gene;

performing a transcriptome-level gene annotation for the plurality of paired reads based on the transcriptome-wide alignment information to generate a second gene name combination identifying the plurality of paired reads based on the corresponding gene; and

identifying identical corresponding genes with which the first gene name combination and the second gene name combination are associated to determine a potential fusion gene based on the identical corresponding genes.

2. The method of claim 1, further comprising:

merging large clusters identified by the same gene combination name to generate supporting clusters associated with the same gene combination name, the supporting clusters being pairwise disregulated read lengths under large clusters identified by the same gene combination name; and

determining potential supporting clusters for the potential fusion genes based on the supporting clusters having the same gene name.

3. The method of claim 1, wherein generating a plurality of large clusters comprises:

in response to determining that a spacing between mapping positions corresponding to the pair-wise misadjusted read lengths satisfies a predetermined clustering distance, clustering for the pair-wise misadjusted read lengths to generate a plurality of small clusters; and

in response to determining that the spacing of the small clusters satisfies a predetermined distance, merging the small clusters to generate the large cluster.

4. The method of claim 1, wherein the first predetermined condition comprises:

the large cluster includes a number of pairwise offset read lengths greater than or equal to a first predetermined logarithm.

5. The method of claim 2, further comprising:

responsive to determining that the number of pairwise disregulated read lengths included in the potential supporting cluster is greater than or equal to a second predetermined logarithm, determining whether the potential fusion gene satisfies the following removal condition:

at least one of a gene family, a pseudogene, a predetermined transcript alignment non-specific combination, a non-coding ribonucleic acid;

support cluster read length directions which belong to adjacent genes and are obtained by comparing two genes in the adjacent genes with a reference transcriptome are opposite;

does not belong to a predetermined transcriptional structure;

in response to determining that the potential fusion gene does not satisfy the removal condition, leaving the potential fusion gene; and

determining that the potential fusion gene left is a reliable potential fusion gene.

6. The method of claim 5, wherein leaving the potential fusion gene behind comprises:

in response to determining that the potential fusion gene does not satisfy the deletion condition, separately aligning the pair-wise deregulated reads corresponding to the corresponding gene with a transcribed reference sequence of a predetermined transcript of the corresponding gene to determine an alignment quality;

determining whether the alignment quality is greater than or equal to a predetermined alignment quality;

in response to determining that the alignment quality is greater than or equal to a predetermined alignment quality, determining pairwise de-aligned read lengths in the potential support cluster that satisfy a predetermined filtering condition; and

leaving the potential fusion gene in response to determining that the number of pairwise dysregulated read lengths that satisfy the predetermined filtering condition is greater than or equal to a third predetermined logarithm.

7. The method of claim 6, wherein leaving the potential fusion gene further comprises:

determining fusion support clusters corresponding to potential fusion genes with the alignment quality greater than or equal to a predetermined alignment quality;

merging coverage areas of positions, corresponding to the fusion support clusters, on the full transcriptome reference sequence according to adjacent preset distance ranges so as to generate merged areas;

determining whether the number of merge regions is less than or equal to a predetermined number; and

leaving the potential fusion genes in response to determining that the number of the merged regions is less than or equal to a predetermined number.

8. The method of claim 7, wherein leaving the potential fusion genes in response to determining that the number of merged regions is less than or equal to a predetermined number comprises:

in response to determining that the number of merge regions is less than or equal to a predetermined number, determining whether the corresponding merge support cluster satisfies at least one of the following conditions:

the break points existing in the fusion supporting cluster meet a preset consistency condition;

the alignment of the fusion supporting clusters on the transcript is a continuous alignment; and

leaving the potential fusion gene in response to determining that the fusion supporting cluster satisfies at least one of the above conditions.

9. The method of claim 8, wherein the predetermined consistency condition comprises:

the same sequences which cannot be continuously aligned exist among all the maladjustment read lengths with the same breakpoint in the fusion support cluster by taking the same breakpoint as a starting point, and the number of all the maladjustment read lengths with the same breakpoint is more than or equal to 2.

10. The method of claim 8, further comprising:

aligning the paired reads in the corresponding fusion support clusters in a genome-wide reference sequence to determine the category of the paired reads based on whether breakpoints exist at both ends;

determining whether at least two different types of the pair of read lengths are included in the fusion supporting cluster; and

leaving the potential fusion gene in response to determining that there are coincident fusion regions of the pairs of reads of different types in corresponding regions on the whole genome reference sequence.

11. The method of claim 10, further comprising:

igv images of the potential gene fusions left behind were collected in order to determine the subtype of fusion gene read by exon numbers involved in fusion based on the igv images.

12. The method of claim 11, wherein determining, based on the igv images, the exon numbers involved in fusion reading the subtype of a fusion gene comprises:

determining the exon number included in the matched position from the exon number 1 to the position before the breakpoint position as the exon number participating in fusion at the 5' end of all the maladjustment reading lengths with the same breakpoint in the fusion supporting cluster and the aligned arrow direction facing to the right; and

and (3) reading all maladjustment reading lengths of the fusion supporting cluster which have the same breakpoint and are aligned in the direction of an arrow head towards the left, and determining the exon number of the first position which can be matched right to the breakpoint position to the right till the extreme exon number is the exon number participating in fusion at the 3' end.

13. A computing device, comprising:

at least one processing unit;

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit causing the computing device to perform the method of any of claims 1-12.

14. A computer-readable storage medium, having stored thereon a computer program which, when executed by a machine, implements the method of any of claims 1-12.