CN116312797B - Method, apparatus and medium for predicting functional fusion for gene structural rearrangement - Google Patents

Method, apparatus and medium for predicting functional fusion for gene structural rearrangement Download PDF

Info

Publication number
CN116312797B
CN116312797B CN202310136782.0A CN202310136782A CN116312797B CN 116312797 B CN116312797 B CN 116312797B CN 202310136782 A CN202310136782 A CN 202310136782A CN 116312797 B CN116312797 B CN 116312797B
Authority
CN
China
Prior art keywords
breakpoint
gene
kinase domain
codon
state information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310136782.0A
Other languages
Chinese (zh)
Other versions
CN116312797A (en
Inventor
陈惠�
王凯
庞菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhiben Medical Laboratory Co ltd
Origimed Technology Shanghai Co ltd
Original Assignee
Shanghai Zhiben Medical Laboratory Co ltd
Origimed Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Zhiben Medical Laboratory Co ltd, Origimed Technology Shanghai Co ltd filed Critical Shanghai Zhiben Medical Laboratory Co ltd
Priority to CN202310136782.0A priority Critical patent/CN116312797B/en
Publication of CN116312797A publication Critical patent/CN116312797A/en
Application granted granted Critical
Publication of CN116312797B publication Critical patent/CN116312797B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The present invention relates to a method, apparatus and medium for predicting functional fusion for structural rearrangements of genes. The method comprises the following steps: calculating a breakpoint of the rearrangement of the gene structure based on the sequencing data; obtaining transcribed edge residual codons corresponding to breakpoint of gene structure rearrangement so as to calculate codon triple status information of the breakpoint reconnection; determining whether a gene involved in gene structural rearrangement comprises a kinase domain; obtaining a genomic range corresponding to the kinase domain in response to determining that the gene involved in the structural rearrangement of the gene comprises the kinase domain, so as to determine inclusion state information about the kinase domain included in the new chimeric transcript formed by the structural rearrangement of the gene; based on the codon triplet state information including the state information and/or the breakpoint reclosing, a functional fusion state is predicted for the genetic structural rearrangement. The invention can realize the function fusion of accurately predicting the rearrangement of the gene structure detected by NGS.

Description

Method, apparatus and medium for predicting functional fusion for gene structural rearrangement
Technical Field
The present invention relates generally to the processing of biological information, and in particular, to a method, computing device, and computer storage medium for fusion of prediction functions for genetic structural rearrangement.
Background
Current second generation gene sequencing (Next-Generation Sequencing, NGS) assays can detect structural rearrangements of genes. Also, there are currently a number of small molecule inhibitor drugs approved by the (C) FDA for functional fusion, such as, but not limited to, inhibitors for ALK, RET, ROS1 fusion are approved in the treatment of non-small cell lung cancer. Therefore, how to accurately predict functional fusion based on structural rearrangement of NGS detected genes is particularly important for therapeutic selection and efficacy of target drugs.
There is no presently disclosed algorithm that can achieve accurate prediction function fusion for gene structural rearrangements detected by NGS.
Disclosure of Invention
The invention provides a method, a computing device and a computer storage medium for fusion of gene structure rearrangement prediction functions, which can realize fusion of gene structure rearrangement prediction functions detected by NGS.
According to a first aspect of the present invention, there is provided a method for predicting functional fusion for structural rearrangements of genes. The method comprises the following steps: calculating a breakpoint of the rearrangement of the gene structure based on the sequencing data; obtaining transcribed edge residual codons corresponding to breakpoint of gene structure rearrangement so as to calculate codon triple status information of the breakpoint reconnection; determining whether a gene involved in gene structural rearrangement comprises a kinase domain; obtaining a genomic range corresponding to the kinase domain in response to determining that the gene involved in the structural rearrangement of the gene comprises the kinase domain, so as to determine inclusion state information about the kinase domain included in the new chimeric transcript formed by the structural rearrangement of the gene; and predicting the functional fusion state for gene structure rearrangement based on the codon triplet state information including the state information and/or the breakpoint reclosing.
According to a second aspect of the present invention there is also provided a computing device, the device comprising: a memory configured to store one or more computer programs; and a processor coupled to the memory and configured to execute one or more programs to cause the apparatus to perform the method of the first aspect of the invention.
According to a third aspect of the present invention, there is also provided a non-transitory computer-readable storage medium. The non-transitory computer readable storage medium has stored thereon machine executable instructions that, when executed, cause a machine to perform the method of the first aspect of the invention.
In some embodiments, predicting a functional fusion state for a structural rearrangement of a gene comprises: determining whether a predetermined condition is satisfied, the predetermined condition including at least one of: the codon triplet state information at the breakpoint reclosing indicates non-frameshifting; the inclusion status information indicates that the new chimeric transcript comprises a kinase domain; the inclusion of status information indicates that the new chimeric transcript portion comprises a kinase domain; if it is determined that the predetermined condition is satisfied, it is determined that the gene structure rearrangements are functionally fused.
In some embodiments, determining the inclusion state information for the new chimeric transcript comprising the kinase domain formed by the rearrangement of the gene structure comprises: calculating the breakpoint of the gene structural rearrangement relative to the region of the kinase domain to determine inclusion status information about the inclusion of the kinase domain in the new chimeric transcript formed by the gene structural rearrangement, the inclusion status information comprising one of: the novel chimeric transcripts comprise kinase domains; the novel chimeric transcript comprises in part a kinase domain; or the new chimeric transcript does not comprise a kinase domain portion.
In some embodiments, further comprising: calculating the breakpoint of the structural rearrangement of the gene relative to the region of the kinase domain includes: obtaining Pdot position areas corresponding to all structural domains from an NCBI protein conserved structural domain database; based on the identification of the indication kinase domain, obtaining a Pdot position area corresponding to the kinase domain; converting the Pdot position region corresponding to the acquired kinase domain into a genome position region; and calculating the breakpoint location of the gene structural rearrangement relative to the converted genomic location region to determine the region of the breakpoint of the gene structural rearrangement relative to the kinase domain.
In some embodiments, calculating codon triplet state information at a breakpoint reclose includes: in the case of negative strand transcription, the remaining codon states of the acquired exons are inversely ordered, the remaining codon states of the acquired exons being the 16 th column information of the reference genome information file of the human reference genome HG19 derived from the UCSC genome browser database; removing the codon state with the first of the remaining codon states of the exons via reverse ordering being 0; if the breakpoint is expressed as a direct gene structure form and is transcribed by a negative strand and is taken as a 3' end, acquiring the complementary state information of the previous codon at the reconnection of the breakpoint, and calculating the codon triplet state information at the reconnection of the breakpoint; and if the breakpoint is expressed as a direct gene structural form and is transcribed by a negative strand and is taken as a 5' end, acquiring the state information of the corresponding codon at the reconnection of the breakpoint, and calculating the codon triplet state information at the reconnection of the breakpoint.
In some embodiments, calculating codon triplet state information at a breakpoint reclose includes: in the case of positive strand transcription, the codon state of the first 0 of the remaining codon states of the obtained exons is removed; if the breakpoint is expressed as a direct gene structure form and is transcribed by a positive strand and is taken as a 3' end, acquiring the complementary state information of the previous codon at the reconnection of the breakpoint, and calculating the codon triplet state information at the reconnection of the breakpoint; and if the breakpoint is expressed as a direct gene structural form and is positive chain transcribed and serves as a 5' end, acquiring the state information of the corresponding codon at the reconnection of the breakpoint, and calculating the codon triplet state information at the reconnection of the breakpoint.
In some embodiments, after removing the first 0 codon state of the remaining codon states of the acquired exons, the method for predicting functional fusion for genetic structural rearrangement further comprises: if the breakpoint is expressed in a genome position form and is positive strand transcribed and serves as a 3' end, acquiring complementary state information of the breakpoint reconnection, and calculating codon triplet state information of the breakpoint reconnection; and if the breakpoint is expressed in the form of a genomic position and is positive strand transcribed and serves as a 5' end, acquiring the state information of the corresponding codon at the breakpoint reclosing, and calculating the codon triplet state information at the breakpoint reclosing.
In some embodiments, determining the inclusion state information for the new chimeric transcript comprising the kinase domain formed by the rearrangement of the gene structure comprises: determining that the portion of the new chimeric transcript comprises the kinase domain if the breakpoint position of the rearrangement of the gene structure is between the start position and the end position of the kinase domain; determining that the new chimeric transcript does not comprise a kinase domain if any of the following is satisfied: the forward strand is transcribed and serves as the 5' end and the breakpoint position is less than the start position of the kinase domain; or the forward strand is transcribed and serves as the 3' end and the breakpoint position is greater than the termination position of the kinase domain; determining that the new chimeric transcript comprises a kinase domain if any of the following is satisfied: the forward strand is transcribed and serves as the 5' end and the breakpoint position is greater than the termination position of the kinase domain; or the forward strand is transcribed and serves as the 3' end and the breakpoint position is less than the starting position of the kinase domain.
In some embodiments, determining the inclusion state information for the new chimeric transcript comprising the kinase domain formed by the rearrangement of the gene structure comprises: determining that the new chimeric transcript does not comprise a kinase domain if any of the following is satisfied: the minus strand is transcribed and serves as the 5' end and the breakpoint position is greater than the termination position of the kinase domain; or the minus strand is transcribed and serves as the 3' end and the breakpoint position is less than the starting position of the kinase domain; determining that the new chimeric transcript comprises a kinase domain if any of the following is satisfied: the minus strand is transcribed and serves as the 5' end and the breakpoint position is less than the starting position of the kinase domain; or the minus strand is transcribed and serves as the 3' end and the breakpoint position is greater than the termination position of the kinase domain.
In some embodiments, the method for predicting functional fusion for gene structural rearrangement further comprises: acquiring gene fusion information in a plurality of predetermined databases so as to determine a predetermined gene fusion set; determining whether the calculated gene structural rearrangement belongs to a predetermined gene fusion set; and determining that the gene structural rearrangement is a functional fusion in response to determining that the gene involved in the gene structural rearrangement belongs to the predetermined set of gene fusions.
In some embodiments, calculating codon triplet state information at a breakpoint reclose includes: and in response to determining that the gene involved in gene structural rearrangement does not belong to the predetermined gene fusion set, obtaining the transcribed marginal residual codons corresponding to the breakpoint of the gene structural rearrangement so as to calculate codon triplet state information at the reconnection of the breakpoint.
The summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the invention, nor is it intended to be used to limit the scope of the invention.
Drawings
Fig. 1 shows a schematic diagram of a system for implementing a method of generating tumor detection report data regarding a target object according to an embodiment of the invention.
FIG. 2 shows a flow chart of a method for predicting functional fusion for genetic structural rearrangement in accordance with an embodiment of the present invention.
FIG. 3 shows a flow chart of a method for determining a rearrangement of a gene structure into functional fusion according to an embodiment of the present invention.
FIG. 4 shows a flow chart of a method for calculating codon triplet state information at breakpoint recloses according to an embodiment of the invention.
FIG. 5 shows a flowchart of a method for calculating a breakpoint of a gene structural rearrangement relative to a region of a kinase domain, according to an embodiment of the present invention.
Fig. 6 schematically shows a block diagram of an electronic device suitable for implementing embodiments of the invention.
Like or corresponding reference characters indicate like or corresponding parts throughout the several views.
Detailed Description
Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are illustrated in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object.
As described above, there is no presently disclosed algorithm that can achieve accurate prediction function fusion for the genetic structural rearrangements detected by NGS.
To at least partially address one or more of the above problems, as well as other potential problems, example embodiments of the present invention propose a solution for predicting functional fusion for genetic structural rearrangements. In this scheme, by calculating a breakpoint of the gene structural rearrangement based on sequencing data; acquiring the transcribed marginal residual codons corresponding to the breakpoint of the gene structure rearrangement calculated based on the sequencing data so as to calculate the codon triple status information of the breakpoint reconnection; determining whether a gene involved in gene structural rearrangement comprises a kinase domain; if it is determined that the gene involved in the structural rearrangement of the gene includes a kinase domain, obtaining a genomic range corresponding to the kinase domain so as to determine inclusion state information about the kinase domain included in the new chimeric transcript formed by the structural rearrangement of the gene; based on the triplet state information containing the state information and/or the codon at the breakpoint reconnection, the invention can realize the accurate prediction function fusion of the gene structure rearrangement detected by NGS aiming at the gene structure rearrangement prediction function fusion state.
FIG. 1 shows a schematic diagram of a system 100 for predicting functional fusion for genetic structural rearrangement in accordance with an embodiment of the present invention. As shown in fig. 1, the system 100 includes: computing device 110, sequencing device 130, network 140, server 150. In some embodiments, the computing device 110, the sequencing device 130, the server 150, and the network 140 are in data interaction.
Regarding the sequencing device 130, for example, for NGS sequencing a plurality of samples to be tested for a plurality of target objects in order to generate sequencing data; and sending the generated sequencing data to the computing device 110.
Regarding the server 150, it is used, for example, to provide information of a plurality of predetermined databases. The predetermined database includes, for example: NCBI, UCSC (Genome Browser) databases, and the like. Computing device 110 may, for example, obtain information from a predetermined database provided by server 150. For example, computing device 110 may obtain, via server 150, an annotation file refGENE file corresponding to human reference genome hg19, for example.
With respect to the computing device 110, it is used, for example, for prediction of functional fusion for gene structural rearrangement. Specifically, the computing device 110 calculates a breakpoint of the genetic structural rearrangement based on the sequencing data; and obtaining the transcribed edge residual codons corresponding to the breakpoint of the gene structure rearrangement so as to calculate the codon triplet state information of the breakpoint reconnection. Computing device 110 is also for determining whether a gene involved in gene structural rearrangement comprises a kinase domain; obtaining a genomic range corresponding to the kinase domain in response to determining that the gene involved in the structural rearrangement of the gene comprises the kinase domain, so as to determine inclusion state information about the kinase domain included in the new chimeric transcript formed by the structural rearrangement of the gene; and predicting the functional fusion state for gene structure rearrangement based on the codon triplet state information including the state information and/or the breakpoint reclosing.
In some embodiments, computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, and ASICs, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device. The computing device 110 includes, for example: a breakpoint calculation unit 112 for gene structure rearrangement, a codon triplet state information calculation unit 114, a kinase domain determination unit 116, a containing state information determination unit 118 and a functional fusion state prediction unit 120. The breakpoint calculation unit 112, the codon triplet state information calculation unit 114, the kinase domain determination unit 116, the state information determination unit 118 of the new chimeric transcript comprising a kinase domain, the functional fusion state prediction unit 120 of the gene structural rearrangement may be configured on one or more computing devices 110.
A breakpoint calculation unit 112 for gene structure rearrangement for calculating a breakpoint of gene structure rearrangement based on sequencing data.
The codon triplet state information calculating unit 114 is configured to obtain the remaining codons at the edge of the transcription corresponding to the breakpoint of the rearrangement of the gene structure, so as to calculate the codon triplet state information at the reconnection of the breakpoint.
Regarding the kinase domain determining unit 116, it is used to determine whether or not a gene involved in gene structural rearrangement includes a kinase domain.
A status information-of-inclusion determination unit 118 for obtaining a genomic range corresponding to the kinase domain if it is determined that the gene involved in the structural rearrangement of the gene includes the kinase domain, so as to determine status information of inclusion of the kinase domain with respect to the new chimeric transcript formed by the structural rearrangement of the gene.
Regarding the functional fusion state prediction unit 120, it is used for predicting the functional fusion state for the genetic structure rearrangement based on the triplet state information including the state information and/or the codon at the breakpoint reclosing.
A method for predicting functional fusion for gene structural rearrangement according to an embodiment of the present invention will be described below with reference to fig. 2. FIG. 2 shows a flow chart of a method 200 for predicting functional fusion for genetic structural rearrangement in accordance with an embodiment of the present invention. It should be appreciated that the method 200 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be performed at the computing device 110 depicted in fig. 1. It should be appreciated that method 200 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.
At step 202, the computing device 110 calculates a breakpoint of the genetic structural rearrangement based on the sequencing data.
Regarding sequencing data, it is, for example, sequencing data obtained based on NGS sequencing.
As for the method of calculating the breakpoint of the gene structural rearrangement, it may be any method of obtaining the breakpoint of the gene structural rearrangement based on NGS sequencing data. In some embodiments, the method of calculating a breakpoint of a structural rearrangement of a gene, for example, comprises: the method comprises the steps of firstly carrying out double-end sequencing on each sequencing fragment obtained by capturing a sample to be tested through a probe to obtain double-end sequencing data, wherein the double-end sequencing data comprise a plurality of paired reading lengths of the sample to be tested. Then, the computing device 110 generates whole genome alignment information based on the alignment of the double-ended sequencing data of the sample to be tested with the whole genome reference sequence. The whole genome alignment information includes, for example: mapping direction and mapping position of each read length comparison on the whole genome, and the obtained intersize of paired read length comparison, and gene structure rearrangement and breakpoint thereof found by the matching condition of the read lengths. It should be understood that other methods for calculating breakpoints for gene structural rearrangements may be used in the present invention.
At step 204, computing device 110 obtains the edge remaining codons of the transcription corresponding to the breakpoint of the genetic structural rearrangement in order to calculate codon triplet state information at the breakpoint reclosing.
In some embodiments, prior to step 204, computing device 110 obtains gene fusion information in a plurality of predetermined databases to determine a predetermined set of gene fusions; determining whether the calculated gene structural rearrangement belongs to a predetermined gene fusion set; and determining that the gene structural rearrangement is a functional fusion in response to determining that the gene involved in the gene structural rearrangement belongs to the predetermined set of gene fusions. If the computing device 110 determines that the gene involved in the gene structural rearrangement does not belong to the predetermined gene fusion set, the edge remaining codons of the transcription corresponding to the breakpoint of the gene structural rearrangement are obtained so as to calculate the codon triplet state information at the breakpoint reclosing.
In some embodiments, computing device 110 may determine whether the codon triplet state information at the calculated breakpoint overlap indicates that a frameshift or a frameshift variation exists. For example, the computing device 110 may obtain, via the server 150, a reference genome information file refgene. Txt file of the human reference genome HG19 derived from UCSC. The computing device 110 obtains column 16 information (i.e., exonFrames) for refGene, which column 16 information indicates the remaining codon status of each exon. Which indicates how the different exons combine together to form an amino acid. In addition, computing device 110 also obtains column 2 and column 4 information of the reference genome information file refgene. Txt file, where column 2 indicates the corresponding gene transcript number (name column) and column 4 indicates the direction of gene transcription (strand column). The computing device 110 determines the remaining codon status of the a gene at the breakpoint reclosing in the gene rearrangement structure and the remaining codon status of the B gene at the breakpoint reclosing in the gene rearrangement structure such that it can be calculated whether the codon triplet status information at the breakpoint reclosing in the gene rearrangement structure indicates a frame shift exists.
Regarding a method of calculating codon triplet state information at breakpoint recloses, it includes, for example: in the case of negative strand transcription, the remaining codon states of the acquired exons are inversely ordered, the remaining codon states of the acquired exons being the 16 th column information of the reference genome information file of the human reference genome HG19 derived from the UCSC genome browser database; removing the codon state with the first of the remaining codon states of the exons via reverse ordering being 0; if the breakpoint is expressed as a direct gene structure form and the negative strand is transcribed and serves as a 3' end, acquiring the complementary state information of the previous codon at the reconnection of the breakpoint, and calculating the triple state information of the codon at the reconnection of the breakpoint; if the breakpoint is expressed as a direct gene structure and is negative strand transcribed and serves as the 5' end, the state information of the corresponding codon at the breakpoint reclosing is obtained and used for calculating the codon triplet state information at the breakpoint reclosing. In the case of positive strand transcription, the codon state of the first 0 of the remaining codon states of the obtained exons is removed; if the breakpoint is expressed as a direct gene structure form and is transcribed by a positive strand and is taken as a 3' end, acquiring the complementary state information of the previous codon at the reconnection of the breakpoint, and calculating the triple state information of the codon at the reconnection of the breakpoint; and if the breakpoint is expressed as a direct gene structural form and is positive chain transcribed and serves as a 5' end, acquiring the state information of the corresponding codon at the reconnection of the breakpoint, and calculating the codon triplet state information at the reconnection of the breakpoint. In some embodiments, after removing the first 0 codon state of the remaining codon states of the acquired exons, the method for predicting functional fusion for genetic structural rearrangement further comprises: if the breakpoint is expressed in a genome position form and is positive strand transcribed and serves as a 3' end, acquiring complementary state information of the breakpoint reconnection, and calculating codon triplet state information of the breakpoint reconnection; and if the breakpoint is expressed in the form of a genomic position and is positive strand transcribed and serves as a 5' end, acquiring the state information of the corresponding codon at the breakpoint reclosing, and calculating the codon triplet state information at the breakpoint reclosing.
The method of calculating codon triplet state information at breakpoint recloses is described in the following in connection with table one example.
List one
It should be understood that breakpoints are divided into 2 representation methods: one such breakpoint is represented as a direct gene structural form, such as: ETV6 exon4 (5 ') -NTRK3 exon14 (3'). One such breakpoint is represented as a form of genomic (genome) location, such as: NTRK3 is chr15:88578205 (annotated to NTRK3 intron 13).
For example, for gene ETV6 at line 1 of table one, the transcript number corresponding to column 2 of the reference genome information file refgene. Txt file is nm_001987, column 4 of the refgene. Txt file indicates forward chain transcription ("+"), and the remaining codon status exoframes of the exon corresponding to column 16 of the refgene. Txt file is "0,0,1,1,1,1,0,2". For the above scenario, computing device 110 determines that ETV6 is forward chain transcription, and the remaining codon states exonFrames after removal of the codon state with the first bit "0" is "0,1,1,1,1,0,2". In addition, the computing device 110 determines that the breakpoint is represented as a direct gene structural form, and ETV6 is forward transcribed and is at the 5' end, and thus the computing device 110 acquires state information of the corresponding codon exonN (n=4) at the junction of the breakpoint, i.e., from exoframes "0,1,1,1,1,0,2, from which the first codon state of" 0 "is removed, the 4 th bit corresponding codon state information X is 1, i.e., the state information of exonN (n=4) is 1, starting from left to right.
For example, for gene NTRK3 at line 2 of table one, the transcript corresponding to column 2 of the reference genome information file refgene. Txt file is numbered nm—002530, column 4 of the refgene. Txt file indicates negative strand transcription ("-") and the remaining codon status exoframes of the exon corresponding to column 16 of the refgene. Txt file is "0,0,2,0,1,1,0,1,1,1,0,1,2,2,2,2,0, -1, -1,". For the above case, computing device 110 determines that NTRK3 is negative strand transcription, reverse ranks the remaining codon states of the acquired exons, via which the remaining codon states of the exons are "-1, -1,0,2,2,2,2,1,0,1,1,1,0,1,1,0,2,0,0". Then, exonFrames after removal of the codon state with the first bit "0" is "-1, -1,2,2,2,2,1,0,1,1,1,0,1,1,0,2,0,0". In addition, the computing device 110 determines that the breakpoint is represented as a direct gene structure form and NTRK3 is negative strand transcribed and is at the 3' end, so the computing device 110 obtains complementary state information of the previous codon exonN-1 (N-1=13) of the corresponding codon exonN (n=14) at the breakpoint overlap for calculating the codon triplet state information at the breakpoint overlap. That is, in exonFrames "-1, -1,2,2,2,2,1,0,1,1,1,0,1,1,0,2,0,0" which are ordered by reverse and are removed first to be "0", the complementary state information of the state information (x=1) of the 13 th bit is 2 (y=2) from left to right. Namely, the state information of exonN-1 (N-1=13) is 1, and the complementary state information is 2.
For another example, for gene FGFR3 at line 3 of table one, the transcript number corresponding to column 2 of the reference genome information file refgene. Txt file is nm—000142, column 4 of the refgene. Txt file indicates forward chain transcription ("+"), and the remaining codon status exonFrames of the exon corresponding to column 16 of the refgene. Txt file is "-1,0,1,1,1,0,1,0,1,0,2,1,1,0,0,2,2,0". For the above scenario, i.e., fgfr3nm_ 000142exon17, computing device 110 determines that FGFR3 is forward chain transcription and removes the first "0" exonFrames to be "-1,1,1,1,0,1,0,1,0,2,1,1,0,0,2,2,0,". In addition, computing device 110 determines that the breakpoint is represented as a direct gene structural form, and that gene FGFR3 is forward transcribed and as the 5' end, computing device 110 obtains state information for the corresponding codon exonN (n=17) at the breakpoint overlap for calculating codon triplet state information at the breakpoint overlap. That is, in "excluding the first exonFrames" -1,1,1,1,0,1,0,1,0,2,1,1,0,0,2,2,0 "of" 0", the state information of the 17 th bit is 0 (x=0) from the left to the right. That is, the state information corresponding to codon exonN (n=17) is 0.
Based on a similar algorithm, for the gene TACC3 at line 4 of table one, i.e. tacc3nm_006349axon 16, the remaining codon status exonFrames of the exon corresponding to column 16 of refgene. Txt file is "-1,0,0,2,2,0,1,0,2,0,0,2,1,0,2,0,". TACC3 is forward chain transcription, and exonFrames after removal of the first "0" codon state is "-1,0,2,2,0,1,0,2,0,0,2,1,0,2,0,". The computing device 110 determines that the breakpoint is represented as a direct gene structure and TACC3 is forward transcribed and serves as the 3' end, obtaining the complementary state information of the previous codon state information at the breakpoint reclosing for computing the codon triplet state information at the breakpoint reclosing. That is, from exonFrames "-1,0,2,2,0,1,0,2,0,0,2,1,0,2,0" from which the first codon state of "0" is removed, the complementary state information of the state information (x=0) of the 15 th bit is 0 from left to right.
If the breakpoint representing method is the breakpoint representing method of the chr, the intron section of the gene needs to be calculated according to the cdsStart and the cdsEnd of the file, and the intron N where the breakpoint is located is confirmed. For example, the gene structure for the chr type breakpoint is annotated with KIF5B chr10:32316247-RET chr10:43611581, with the annotation result being, for example: kif5b nm_004521intron15-RET nm_020975intron11. The transcript number (name column) corresponding to column 2 and the gene transcription direction (strand column) of column 4 of the reference genome information file refgene. Txt file are respectively: NM-004521 of KIF5B and negative strand transcription ("-"). The corresponding exoframes column of the reference genome information file refgene. Txt file is "0,0,1,0,0,1,0,1,0,0,2,1,0,0,0,0,0,1,0,2,2,0,0,0,1, -1". The computing device 110 determines that the breakpoint is represented as a genomic location form and KIF5B is 5' to the minus strand transcript. The computing device 110 obtains the state information of the corresponding codon intron15 (n=15) at the breakpoint reclosing for computing the codon triplet state information at the breakpoint reclosing. That is, in the exonFrame "0,1,0,0,1,0,1,0,0,2,1,0,0,0,0,0,1,0,2,2,0,0,0,1, -1" after the first codon state of "0" is removed, the state information of the 15 th bit is x=0 from left to right. In addition, computing device 110 determines that the breakpoint is represented as a genomic location form and that nm_020975 of RET is the 3' end of the forward transcript, listed as "0,1,1,1,0,1,0,1,1,1,1,0,1,1,0,0,2,2,0,1" with reference to the corresponding exoframes of the genomic information file refgene. Txt file. The computing device 110 obtains complementary state information of the corresponding codon intron11 (n=11) at the breakpoint reclosing for computing the codon triplet state information at the breakpoint reclosing. That is, from the exonFrames after the first codon state of "0" is removed to "1,1,1,0,1,0,1,1,1,1,0,1,1,0,0,2,2,0,1", the complementary state of the state information x= "0" of the 11 th bit is y= "0" from left to right. It should be noted that X represents the state information of the corresponding codon. Y represents the complementary state information of the state information X of the corresponding codon. For example, if x=0, the complementary state information y=0 of X. If x=2, the complementary state information y=1 of X. If x=1, the complementary state information y=2 of X.
At step 206, computing device 110 determines whether a gene involved in gene structural rearrangement includes a kinase domain. If the computing device 110 determines that the gene involved in the gene structural rearrangement does not include a kinase domain, then the process proceeds to step 212, where a functional fusion state is predicted for the gene structural rearrangement based on the codon triplet state information at the breakpoint overlap. Regarding kinases, which are also called protein kinases (PK for short), are enzymes that catalyze the protein phosphorylation process. The phosphorylation process of proteins is the last link in the transmission of neurological information in cells, leading to changes in the state of ion channel proteins and channel gates. More than about 400 protein kinases have been found, each of which has a homologous catalytic domain consisting of about 270 amino acid residues. The kinase domain is a part of a kinase, is a protein sequence of the kinase, and is a domain which plays a role in phosphorylating proteins. Thus, it can be determined whether the gene involved in the structural rearrangement of the gene is a kinase gene by determining whether the gene involved in the structural rearrangement of the gene includes a kinase domain.
At step 208, if the computing device 110 determines that the gene involved in the structural rearrangement of the gene includes a kinase domain, a genomic range of the corresponding kinase domain is obtained in order to determine inclusion state information about the kinase domain included in the new chimeric transcript formed by the structural rearrangement of the gene.
A method for determining status information for a new chimeric transcript comprising a kinase domain, comprising, for example: calculating the breakpoint of the gene structural rearrangement relative to the region of the kinase domain to determine status information regarding the kinase domain comprising the new chimeric transcript formed by the gene structural rearrangement, the status information comprising the kinase domain comprising one of: the novel chimeric transcripts comprise kinase domains; the novel chimeric transcript comprises in part a kinase domain; or the new chimeric transcript does not comprise a kinase domain portion.
For example, the computing device 110 obtains a genomic range of a corresponding kinase domain of a corresponding protein tyrosine kinase (protein tyrosine kinase, PTK) gene and then calculates the region of the breakpoint of the structural rearrangement of the gene relative to the kinase domain in order to determine status information regarding the formation of a new chimeric transcript comprising the kinase domain as a result of the structural rearrangement of the gene. It is understood that the function of a protein is closely related to its structure, and that a conserved domain of a protein embodies the function of the protein to some extent. Assuming that the domain region of RNA is at the 3 'end, the gene at the 5' end is replaced, for example, with a long promoter sequence, so that the domain region is more actively expressed. Thus, if the kinase is expressed more actively, its original kinase domain should not be frameshifted when forming the gene rearrangement structure, which if present may disrupt the normal function of the original kinase. Thus, if the computing device 110 determines that the gene involved in the rearrangement of the gene structure includes a kinase domain, i.e., the gene involved in the rearrangement of the gene structure is a kinase gene, then by determining that the new chimeric transcript includes status information for the kinase domain, it can be determined whether the gene rearrangement structure is functionally fused.
As to a method of calculating the breakpoint of a structural rearrangement of a gene relative to the region of a kinase domain, it includes, for example: the computing device 110 obtains Pdot location areas corresponding to all domains from the NCBI protein conserved domain database; based on the identification of the indication kinase domain, obtaining a Pdot position area corresponding to the kinase domain; converting the Pdot position region corresponding to the acquired kinase domain into a genome position region; and calculating the breakpoint location of the gene structural rearrangement relative to the converted genomic location region to determine the region of the breakpoint of the gene structural rearrangement relative to the kinase domain. The method for calculating the breakpoint of the structural rearrangement of the gene relative to the region of the kinase domain will be described in detail below in conjunction with fig. 5, and will not be described in detail here.
Specifically, if it is a plus strand transcript and serves as the 5' end and the breakpoint position is between the start and end positions of the kinase domain, it is determined that the new chimeric transcript portion comprises the kinase domain; if it is a plus strand transcript and serves as the 5' end and the breakpoint position is less than the start position of the kinase domain, determining that the new chimeric transcript does not contain a kinase domain; if it is a plus strand transcript and serves as the 5' end and the breakpoint position is greater than the termination position of the kinase domain, it is determined that the new chimeric transcript contains the kinase domain.
If it is forward transcribed and is at the 3' end and the breakpoint position is between the start and end positions of the kinase domain, determining that the portion of the new chimeric transcript comprises the kinase domain; if it is a forward transcript and serves as the 3' end and the breakpoint position is less than the start position of the kinase domain, determining that the new chimeric transcript comprises the kinase domain; if it is positive-strand transcribed and is at the 3' end and the breakpoint position is greater than the termination position of the kinase domain, it is determined that the new chimeric transcript does not contain a kinase domain.
If it is negative strand transcribed and serves as the 5' end and the breakpoint position is between the start and end positions of the kinase domain, determining that the new chimeric transcript portion comprises the kinase domain; if it is negative strand transcribed and serves as the 5' end and the breakpoint position is less than the starting position of the kinase domain, determining that the new chimeric transcript comprises the kinase domain; if it is negative strand transcribed and is at the 5' end and the breakpoint position is greater than the termination position of the kinase domain, it is determined that the new chimeric transcript does not contain a kinase domain.
If it is negative strand transcribed and serves as the 3' end and the breakpoint position is between the start and end positions of the kinase domain, determining that the new chimeric transcript portion comprises the kinase domain; if it is negative strand transcribed and serves as the 3' end and the breakpoint position is less than the starting position of the kinase domain, determining that the new chimeric transcript does not contain a kinase domain; if it is negative strand transcribed and is at the 3' end and the breakpoint position is greater than the termination position of the kinase domain, it is determined that the new chimeric transcript comprises the kinase domain.
Methods for determining the inclusion status information of a new chimeric transcript comprising a kinase domain are exemplified below in connection with table two.
Watch II
Table two shows that the gene NTRK3 is negative strand transcribed and is the 3' end. The breakpoint position is 88576276 (i.e., the termination position end_gcoc of transcript nm_ 002530). The start position start_gloc= 88420190, end position end_gloc= 88483976 of the kinase domain ptkc_trkc. Since breakpoint position 88576276 is greater than end position end_gloc= 88483976, computing device 110 determines that the new chimeric transcript comprises a kinase domain.
At step 210, computing device 110 predicts a functional fusion state for the genetic structural rearrangement based on the codon triplet state information including the state information and/or the breakpoint reclosing.
Regarding a method for predicting a functional fusion state for a structural rearrangement of a gene, it includes, for example: the computing device 110 determines whether a predetermined condition is satisfied, the predetermined condition including at least one of: the codon triplet state information at the breakpoint reclosing indicates non-frameshifting; the inclusion status information indicates that the new chimeric transcript comprises a kinase domain; the inclusion status information indicates that the new chimeric transcript portion includes a kinase domain; if the computing device 110 determines that the predetermined condition is met, it determines that the genetic structure rearrangements are functional fusions. The method for predicting the fusion state of functions for gene structural rearrangement will be specifically described with reference to fig. 3, and will not be described here.
In the above scheme, calculating breakpoint of gene structure rearrangement based on sequencing data; acquiring the transcribed marginal residual codons corresponding to the breakpoint of the gene structure rearrangement calculated based on the sequencing data so as to calculate the codon triple status information of the breakpoint reconnection; determining whether a gene involved in gene structural rearrangement comprises a kinase domain; if it is determined that the gene involved in the structural rearrangement of the gene includes a kinase domain, obtaining a genomic range corresponding to the kinase domain so as to determine inclusion state information about the kinase domain included in the new chimeric transcript formed by the structural rearrangement of the gene; based on the triplet state information containing the state information and/or the codon at the breakpoint reconnection, the invention can realize the accurate prediction function fusion of the gene structure rearrangement detected by NGS aiming at the gene structure rearrangement prediction function fusion state. A method for determining structural rearrangement of genes into functional fusion according to an embodiment of the present invention will be described below with reference to fig. 3. FIG. 3 shows a flow chart of a method 300 for determining a rearrangement of a gene structure into functional fusion, according to an embodiment of the present invention. It should be appreciated that the method 300 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be performed at the computing device 110 depicted in fig. 1. It should be appreciated that method 300 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.
At step 302, computing device 110 determines whether a predetermined condition is satisfied, the predetermined condition including at least one of: the codon triplet state information at the breakpoint reclosing indicates non-frameshifting; the inclusion status information indicates that the new chimeric transcript comprises a kinase domain; the inclusion status information indicates that the new chimeric transcript portion includes a kinase domain.
Non-frameshift, i.e., no frameshift mutations exist, or no reading frame changes are caused. The frame shift mutation (frame shift mutation) is a shift in transcription of an amino acid triplet, in which 1, 2 or several pairs of bases are inserted or deleted at a certain point in the DNA sequence (not 3 or a multiple of 3, i.e., the addition or subtraction of bases does not correspond to 1 or more triplets). The codon triplet state information at the breakpoint reclosure indicates a non-frameshift, indicating that the genetic structural rearrangement is translated into a normal amino acid.
At step 304, if the computing device 110 determines that the predetermined condition is met, it is determined that the genetic structure rearrange as a functional fusion.
For example, for gene structural rearrangements: etv6nm_ 001987exon1-4 (transcribed 5 'end) -ntrk3nm_ 002530exon14-Nf (transcribed 3' end) (where Nf represents the last exon number). Taking one gene NTRK3 in the gene structural rearrangement as a 3' end, taking the complementary state of the codon state of exonN-1 (n=14, N-1=13). Wherein, the codon state of exonN-1 (N-1=13) is 1 and the complementary state of the codon state of exonN-1 (N-1=13) is 2. The other gene ETV6 in the gene structural rearrangement takes the state corresponding to the codon of exonN (n=4) as the 5' end. The state corresponding to the codon of exonN (n=4) is 1. Reading frame status at breakpoint reclose: x is X ETV6 +Y NTRK3 =2+1=3 (X represents the corresponding state of the codon; Y represents the complementary state of the corresponding state of the codon, it being understood that the corresponding state of the codon is 0, the complementary state of the corresponding state of the codon is 0, the corresponding state of the codon is 2, the complementary state of the corresponding state of the codon is 1, the complementary state of the corresponding state of the codon is 2. The triplet codon state at the breakpoint overlap indicates that the non-frameshift in-frame. I.e. the predetermined condition is met.
For another example, for gene structural rearrangements: ETV6chr12:12 011, 516 (transcription 5' -end) -NTRK3 chr15:88 597, 803 (transcription 3' end), breakpoint in NTRK3. Computing device 110 determines that NTRK3 is negative strand transcribed and is the 3' end. The breakpoint position is a termination position where the breakpoint position is greater than the kinase domain, and thus, the computing device 110 determines that the inclusion status information indicates that the new chimeric transcript includes a kinase domain, i.e., the predetermined condition is satisfied. Thus, computing device 110 determines a gene structural rearrangement: ETV6chr12:12 011, 516 (transcription 5' -end) -NTRK3 chr15:88 597, 803 (transcription 3' end) is a functional fusion. By employing the means described above, the present disclosure can accurately determine whether it is a functional fusion for a non-known rearrangement of gene structures.
A method for calculating codon triplet state information at breakpoint recloses according to an embodiment of the present invention will be described below in connection with fig. 4. FIG. 4 shows a flowchart of a method 400 for calculating codon triplet state information at breakpoint recloses, according to an embodiment of the invention. It should be appreciated that the method 400 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be performed at the computing device 110 depicted in fig. 1. It should be appreciated that method 400 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.
At step 402, computing device 110 obtains gene fusion information in a plurality of predetermined databases to determine a predetermined set of gene fusions.
As a method of determining a predetermined set of gene fusions, for example, it includes: computing device 110 obtains the gene fusion information in a plurality of predetermined databases and determines whether the number of times the gene fusion information occurs exceeds a predetermined threshold; if the number of times the gene fusion information occurs exceeds a predetermined threshold, a predetermined gene fusion set is determined based on the gene fusion information exceeding the predetermined threshold. Wherein the predetermined set of gene fusions includes a plurality of structural rearrangements of genes identified as functional fusions.
At step 404, the computing device 110 determines whether the calculated gene structural rearrangement belongs to a predetermined set of gene fusions.
At step 406, if the computing device 110 determines that the genes involved in the rearrangement of the gene structure belong to a predetermined set of gene fusions, the rearrangement of the gene structure is determined to be a functional fusion.
At step 408, if the computing device 110 determines that the gene involved in the gene structural rearrangement does not belong to the predetermined set of gene fusions, the edge remaining codons of the transcription corresponding to the breakpoint of the gene structural rearrangement are obtained to calculate codon triplet state information at the breakpoint reclosing.
For example, if computing device 110 determines that the gene structural rearrangement "FGFR3NM_ 000142 exo1-17 (predictive transcription 5' end)) The gene involved in tacc3nm_ 006342 exo16 "does not belong to the predetermined gene fusion set, and the computing device 110 determines the remaining codon status of the gene FGFR3 at the breakpoint reclosure and the remaining codon status of the gene TACC3 at the breakpoint reclosure in the gene rearrangement structure based on the remaining codon status of each exon indicated by the 19 th column information of the acquired BED file, respectively, so as to calculate whether the codon triplet status information at the breakpoint reclosure of the gene rearrangement structure indicates the presence of a frame shift. Specifically, one gene FGFR3 in the gene structural rearrangement is forward transcribed and is regarded as the 5' end, and the codon correspondence state of exonN (n=17) No. is determined, and the codon correspondence state of exonN (n=17) is 0. Another gene TACC3 in the gene structural rearrangement is forward transcribed and as the 3' end, the complementary state of the corresponding state of the codon of exonN-1 (n=16, N-1=15) No. is determined, wherein the corresponding state of the codon of exonN-1 (N-1=15) No. is 0 and the complementary state of the corresponding state of the codon of exonN-1 (N-1=15) No. is 0. Reading frame state at breakpoint reconnection is X FGFR3 +Y TACC3 =0+0=0, forming a normal triplet codon state. That is, the triplet codon state at the breakpoint reclose indicates a non-frameshift in-frame. That is, the predetermined condition is satisfied. Thus, computing device 110 determines a gene structural rearrangement: FGFR3NM_ 000142exon1-17 (predicted transcription 5' end) -TACC3NM_ 006342exon16 is a functional fusion. It should be appreciated that when calculating the codon triplet state information (frame) at the breakpoint reclosing of the gene rearrangement structure, it is possible to determine that the gene structure rearranges as a functional fusion, as long as one is "-1" (indicated on utr) and the other is not "-1"; if both are "-1", then the triplet codon status at the breakpoint reclosing is determined to indicate a frameshift out-frame, and thus it can be determined that the structural rearrangement of the gene is not a functional fusion.
By adopting the means, the method and the device can accurately calculate the codon triple status information of the breakpoint reconnection.
A method for calculating the breakpoint of a gene structural rearrangement relative to the region of the kinase domain according to an embodiment of the present invention will be described below in connection with fig. 5. FIG. 5 shows a flowchart of a method 500 for calculating a breakpoint of a genetic structural rearrangement relative to a region of a kinase domain, according to an embodiment of the present invention. It should be appreciated that the method 500 may be performed, for example, at the electronic device 600 depicted in fig. 6. May also be performed at the computing device 110 depicted in fig. 1. It should be appreciated that method 500 may also include additional actions not shown and/or may omit actions shown, the scope of the invention being not limited in this respect.
At step 502, the computing device 110 obtains Pdot location regions corresponding to all domains from the NCBI protein conserved domain database. For example, the NCBI protein conserved domain database (which collects a large number of conserved domains, conserved Domain Database, CDD, sequence information and protein sequences) is obtained at computing device 110 to obtain the Pdot location region corresponding to all domains.
At step 504, computing device 110 obtains a Pdot location region corresponding to the kinase domain based on the identification indicative of the kinase domain. The kinase domain typically has a corresponding identity, e.g., ptkctrkc corresponds to the identity kinaseDomain.
At step 506, computing device 110 converts the Pdot location region corresponding to the acquired kinase domain into a genomic location region (i.e., a Gdot location region). The reason for converting the Pdot position region corresponding to the acquired kinase domain into the Gdot position region is that: the obtained breakpoint position information is genomic position information, so that the position information breakpoint positions of the transformed kinase domain on the genome can be compared.
At step 508, the computing device 110 calculates a breakpoint location of the genetic structural rearrangement relative to the transformed genomic location region to determine a region of the breakpoint of the genetic structural rearrangement relative to the kinase domain. For example, the computing device 110 can calculate in which relative region of the kinase domain the breakpoint position is (e.g., calculate the relative percentage of the breakpoint position in the kinase domain), thereby determining that the status information for the new chimeric transcript comprising the kinase domain is: comprising a kinase domain, a portion comprising a kinase domain, or no kinase domain portion.
Fig. 6 schematically shows a block diagram of an electronic device 600 suitable for use in implementing embodiments of the invention. The electronic device 600 may be for implementing the methods 200 to 500 shown in fig. 2 to 4. As shown in fig. 5, the electronic device 600 includes a central processing unit (i.e., CPU 601) that can perform various suitable actions and processes according to computer program instructions stored in a read-only memory (i.e., ROM 602) or computer program instructions loaded from a storage unit 608 into a random access memory (i.e., RAM 603). In the RAM 603, various programs and data required for the operation of the electronic device 600 can also be stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output interface (i.e., I/O interface 605) is also connected to bus 604.
A number of components in the electronic device 600 are connected to the I/O interface 605, including: the input unit 606, the output unit 607, the storage unit 608, and the cpu 601 perform the respective methods and processes described above, for example, perform the methods 200 to 500. For example, in some embodiments, the methods 200-500 may be implemented as a computer software program stored on a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by CPU 601, one or more of the operations of methods 200 through 500 described above may be performed. Alternatively, in other embodiments, CPU 601 may be configured to perform one or more actions of methods 200-500 in any other suitable manner (e.g., by means of firmware).
It should be further appreciated that the present invention can be a method, apparatus, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor in a voice interaction device, a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
The above is only an alternative embodiment of the present invention and is not intended to limit the present invention, and various modifications and variations will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. A method for predicting functional fusion for structural rearrangement of a gene comprising:
calculating a breakpoint of the rearrangement of the gene structure based on the sequencing data;
obtaining transcribed edge residual codons corresponding to breakpoint of gene structure rearrangement so as to calculate codon triple status information of the breakpoint reconnection;
Determining whether a gene involved in gene structural rearrangement comprises a kinase domain;
obtaining a genomic range corresponding to the kinase domain in response to determining that the gene involved in the structural rearrangement of the gene comprises the kinase domain, so as to determine inclusion state information about the kinase domain included in the new chimeric transcript formed by the structural rearrangement of the gene;
based on the codon triplet state information comprising the state information and/or the breakpoint reclosing, predicting a functional fusion state for the genetic structural rearrangement,
the calculating of the codon triple status information at the breakpoint reclosing comprises the following steps:
in the case of negative strand transcription, the remaining codon states of the acquired exons are inversely ordered, the remaining codon states of the acquired exons being the 16 th column information of the reference genome information file of the human reference genome HG19 derived from the UCSC genome browser database;
removing the codon state with the first of the remaining codon states of the exons via reverse ordering being 0;
if the breakpoint is expressed as a direct gene structure form and is transcribed by a negative strand and is taken as a 3' end, acquiring the complementary state information of the previous codon at the reconnection of the breakpoint, and calculating the codon triplet state information at the reconnection of the breakpoint; and
If the breakpoint is expressed as a direct gene structure and is negative strand transcribed and serves as the 5' end, the state information of the corresponding codon at the breakpoint reclosing is obtained and used for calculating the codon triplet state information at the breakpoint reclosing.
2. The method of claim 1, wherein predicting a functional fusion state for a structural rearrangement of a gene comprises:
determining whether a predetermined condition is satisfied, the predetermined condition including at least one of:
the codon triplet state information at the breakpoint reclosing indicates non-frameshifting;
the inclusion status information indicates that the new chimeric transcript comprises a kinase domain; and
the inclusion status information indicates that the new chimeric transcript portion includes a kinase domain;
if it is determined that the predetermined condition is satisfied, it is determined that the gene structure rearrangements are functionally fused.
3. The method of claim 1, wherein determining inclusion state information about a kinase domain of a new chimeric transcript formed by a structural rearrangement of a gene comprises:
calculating the breakpoint of the gene structural rearrangement relative to the region of the kinase domain to determine inclusion status information about the inclusion of the kinase domain in the new chimeric transcript formed by the gene structural rearrangement, the inclusion status information comprising one of:
The novel chimeric transcripts comprise kinase domains;
the novel chimeric transcript comprises in part a kinase domain; or alternatively
The novel chimeric transcripts do not comprise a kinase domain portion.
4. The method of claim 3, wherein calculating the breakpoint of the structural rearrangement of the gene relative to the region of the kinase domain comprises:
obtaining Pdot position areas corresponding to all structural domains from an NCBI protein conserved structural domain database;
based on the identification of the indication kinase domain, obtaining a Pdot position area corresponding to the kinase domain;
converting the Pdot position region corresponding to the acquired kinase domain into a genome position region; and
the breakpoint position of the gene structural rearrangement is calculated relative to the converted genomic position region to determine the region of the breakpoint of the gene structural rearrangement relative to the kinase domain.
5. The method of claim 1, wherein calculating codon triplet state information at a breakpoint reclose comprises:
in the case of positive strand transcription, the codon state of the first 0 of the remaining codon states of the obtained exons is removed;
if the breakpoint is expressed as a direct gene structure form and is transcribed by a positive strand and is taken as a 3' end, acquiring the complementary state information of the previous codon at the reconnection of the breakpoint, and calculating the codon triplet state information at the reconnection of the breakpoint; and
If the breakpoint is expressed as a direct gene structure and is positive-strand transcribed and serves as the 5' end, the state information of the corresponding codon at the breakpoint reclosing is obtained and used for calculating the codon triplet state information at the breakpoint reclosing.
6. The method of claim 5, wherein after removing the codon state of 0 that is the first of the remaining codon states of the acquired exons, the method further comprises:
if the breakpoint is expressed in a genome position form and is positive strand transcribed and serves as a 3' end, acquiring complementary state information of the breakpoint reconnection, and calculating codon triplet state information of the breakpoint reconnection; and
if the breakpoint is expressed as a genomic position and is forward transcribed and serves as the 5' end, the state information of the corresponding codon at the breakpoint overlap is obtained for calculating the codon triplet state information at the breakpoint overlap.
7. The method of claim 3, wherein determining inclusion state information about a kinase domain of a new chimeric transcript formed by a structural rearrangement of the gene comprises:
determining that the portion of the new chimeric transcript comprises the kinase domain if the breakpoint position of the rearrangement of the gene structure is between the start position and the end position of the kinase domain;
Determining that the new chimeric transcript does not comprise a kinase domain if any of the following is satisfied:
the forward strand is transcribed and serves as the 5' end and the breakpoint position is less than the start position of the kinase domain; or alternatively
The forward strand is transcribed and serves as the 3' end and the breakpoint position is greater than the termination position of the kinase domain;
determining that the new chimeric transcript comprises a kinase domain if any of the following is satisfied:
the forward strand is transcribed and serves as the 5' end and the breakpoint position is greater than the termination position of the kinase domain; or alternatively
The forward strand is transcribed and serves as the 3' end and the breakpoint position is less than the start position of the kinase domain.
8. The method of claim 7, wherein determining inclusion state information about a kinase domain of a new chimeric transcript formed by a structural rearrangement of a gene comprises:
determining that the new chimeric transcript does not comprise a kinase domain if any of the following is satisfied:
the minus strand is transcribed and serves as the 5' end and the breakpoint position is greater than the termination position of the kinase domain; or alternatively
The minus strand is transcribed and serves as the 3' end and the breakpoint position is less than the starting position of the kinase domain;
determining that the new chimeric transcript comprises a kinase domain if any of the following is satisfied:
The minus strand is transcribed and serves as the 5' end and the breakpoint position is less than the starting position of the kinase domain; or alternatively
The negative strand is transcribed and serves as the 3' end and the breakpoint position is greater than the termination position of the kinase domain.
9. The method as recited in claim 1, further comprising:
acquiring gene fusion information in a plurality of predetermined databases so as to determine a predetermined gene fusion set;
determining whether the calculated gene structural rearrangement belongs to a predetermined gene fusion set; and
responsive to determining that the gene involved in the structural rearrangement of the gene belongs to a predetermined set of gene fusions, the structural rearrangement of the gene is determined to be a functional fusion.
10. The method of claim 9, wherein calculating codon triplet state information at a breakpoint reclose comprises:
and in response to determining that the gene involved in gene structural rearrangement does not belong to the predetermined gene fusion set, obtaining the transcribed marginal residual codons corresponding to the breakpoint of the gene structural rearrangement so as to calculate codon triplet state information at the reconnection of the breakpoint.
11. A computing device, comprising:
at least one processing unit;
at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, which when executed by the at least one processing unit, cause the apparatus to perform the steps of the method according to any one of claims 1 to 10.
12. A computer readable storage medium, characterized in that a computer program is stored on the computer readable storage medium, which computer program, when executed by a machine, implements the method according to any of claims 1 to 10.
CN202310136782.0A 2023-02-17 2023-02-17 Method, apparatus and medium for predicting functional fusion for gene structural rearrangement Active CN116312797B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310136782.0A CN116312797B (en) 2023-02-17 2023-02-17 Method, apparatus and medium for predicting functional fusion for gene structural rearrangement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310136782.0A CN116312797B (en) 2023-02-17 2023-02-17 Method, apparatus and medium for predicting functional fusion for gene structural rearrangement

Publications (2)

Publication Number Publication Date
CN116312797A CN116312797A (en) 2023-06-23
CN116312797B true CN116312797B (en) 2024-02-20

Family

ID=86819685

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310136782.0A Active CN116312797B (en) 2023-02-17 2023-02-17 Method, apparatus and medium for predicting functional fusion for gene structural rearrangement

Country Status (1)

Country Link
CN (1) CN116312797B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102177235A (en) * 2008-09-08 2011-09-07 赛莱克蒂斯公司 Meganuclease variants cleaving a DNA target sequence from a glutamine synthetase gene and uses thereof
CN108229100A (en) * 2018-05-22 2018-06-29 至本医疗科技(上海)有限公司 DNA resets region and corresponding RNA product detections method, equipment and storage medium
CN111292809A (en) * 2020-01-20 2020-06-16 至本医疗科技(上海)有限公司 Method, electronic device, and computer storage medium for detecting RNA level gene fusion
CN112599188A (en) * 2021-03-01 2021-04-02 上海思路迪医学检验所有限公司 DNA fusion breakpoint annotation method for single-end anchoring of fusion driving gene
CN114822700A (en) * 2022-04-25 2022-07-29 至本医疗科技(上海)有限公司 Methods, devices and media for presenting rearranged or fused structural subtypes
EP4114979A2 (en) * 2020-03-04 2023-01-11 Foundation Medicine, Inc. Bcor rearrangements and uses thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102177235A (en) * 2008-09-08 2011-09-07 赛莱克蒂斯公司 Meganuclease variants cleaving a DNA target sequence from a glutamine synthetase gene and uses thereof
CN108229100A (en) * 2018-05-22 2018-06-29 至本医疗科技(上海)有限公司 DNA resets region and corresponding RNA product detections method, equipment and storage medium
CN111292809A (en) * 2020-01-20 2020-06-16 至本医疗科技(上海)有限公司 Method, electronic device, and computer storage medium for detecting RNA level gene fusion
EP4114979A2 (en) * 2020-03-04 2023-01-11 Foundation Medicine, Inc. Bcor rearrangements and uses thereof
CN112599188A (en) * 2021-03-01 2021-04-02 上海思路迪医学检验所有限公司 DNA fusion breakpoint annotation method for single-end anchoring of fusion driving gene
CN114822700A (en) * 2022-04-25 2022-07-29 至本医疗科技(上海)有限公司 Methods, devices and media for presenting rearranged or fused structural subtypes

Also Published As

Publication number Publication date
CN116312797A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
Caspar et al. Clinical sequencing: from raw data to diagnosis with lifetime value
Meng et al. MitoZ: a toolkit for animal mitochondrial genome assembly, annotation and visualization
Jain et al. Improved data analysis for the MinION nanopore sequencer
Lu et al. Oxford Nanopore MinION sequencing and genome assembly
Sharon et al. A single-molecule long-read survey of the human transcriptome
Seo et al. De novo assembly and phasing of a Korean human genome
Kumasaka et al. Fine-mapping cellular QTLs with RASQUAL and ATAC-seq
El-Metwally et al. Next generation sequencing technologies and challenges in sequence assembly
Wright et al. Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow
Chen et al. Overview of available methods for diverse RNA-Seq data analyses
Sikkema‐Raddatz et al. Targeted next‐generation sequencing can replace Sanger sequencing in clinical diagnostics
Meyer et al. Gene structure conservation aids similarity based gene prediction
Chen et al. Emergence, retention and selection: a trilogy of origination for functional de novo proteins from ancestral LncRNAs in primates
Venter et al. Proteogenomic analysis of bacteria and archaea: a 46 organism case study
Zhang et al. Limitations of the rhesus macaque draft genome assembly and annotation
Zhu et al. OTG-snpcaller: an optimized pipeline based on TMAP and GATK for SNP calling from ion torrent data
Ramaprasad et al. Comprehensive evaluation of Toxoplasma gondii VEG and Neospora caninum LIV genomes with tachyzoite stage transcriptome and proteome defines novel transcript features
Masoudi-Nejad et al. Next generation sequencing and sequence assembly: methodologies and algorithms
Prosdocimi et al. Controversies in modern evolutionary biology: the imperative for error detection and quality control
US10658069B2 (en) Biological sequence variant characterization
CN116386718B (en) Method, apparatus and medium for detecting copy number variation
Lang et al. Mitochondrial genome annotation with MFannot: a critical analysis of gene identification and gene model prediction
CN111524548B (en) Method, computing device, and computer storage medium for detecting IGH reordering
Gréen et al. Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated Genes
Shortt et al. Whole genome amplification and reduced-representation genome sequencing of Schistosoma japonicum miracidia

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant