WO2023184065A1 - 融合基因的鉴定方法、装置、设备、程序及存储介质 - Google Patents

融合基因的鉴定方法、装置、设备、程序及存储介质 Download PDF

Info

Publication number
WO2023184065A1
WO2023184065A1 PCT/CN2022/083275 CN2022083275W WO2023184065A1 WO 2023184065 A1 WO2023184065 A1 WO 2023184065A1 CN 2022083275 W CN2022083275 W CN 2022083275W WO 2023184065 A1 WO2023184065 A1 WO 2023184065A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
fragment
sequence
fusion gene
fragments
Prior art date
Application number
PCT/CN2022/083275
Other languages
English (en)
French (fr)
Inventor
刘梦佳
Original Assignee
京东方科技集团股份有限公司
成都京东方光电科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司, 成都京东方光电科技有限公司 filed Critical 京东方科技集团股份有限公司
Priority to PCT/CN2022/083275 priority Critical patent/WO2023184065A1/zh
Priority to CN202280000556.3A priority patent/CN117136411A/zh
Publication of WO2023184065A1 publication Critical patent/WO2023184065A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • the present disclosure belongs to the field of gene detection technology, and particularly relates to a fusion gene identification method, device, equipment, program and storage medium.
  • Gene fusion is a process in which chromosomal translocation, deletion or inversion causes all or part of the sequences of two unrelated genes to fuse with each other into a new gene.
  • Tens of thousands of gene fusions have been discovered.
  • many gene fusions have been reported to be closely related to the occurrence of cancer, among which ALK (Anaplastic Lymphoma Kinase, anaplastic lymphoma kinase), ROS1 (ROS proto-oncogene 1, receptortyrosine kinase, c-ros sarcoma oncogenic factor-receptor Tyrosine kinase), NTRK (NeuroTrophin Receptor Kinase, neurotrophic factor receptor tyrosine kinase) and other common fusion genes are used as diagnostic tools for some cancers and other diseases.
  • ALK Anaplastic Lymphoma Kinase, anaplastic lymphoma kinase
  • ROS1 ROS proto-onc
  • the present disclosure provides a method, device, equipment, program and storage medium for identifying fusion genes.
  • screening out target fusion gene pairs from the sequencing fragments based on the distribution and target capture results includes:
  • Target fusion gene pairs are screened out from the sequencing fragments based on the breakpoint position and the position spanning the sequencing fragments.
  • selecting target fusion gene pairs from the sequencing fragments based on the breakpoint position and the position spanning the sequencing fragments includes:
  • the filtered and retained sequencing fragments are used as candidate fusion gene pairs;
  • filtering low-quality fusion gene pairs among the candidate fusion gene pairs to obtain target fusion gene pairs includes:
  • calculating the fusion gene score of the second candidate fusion gene pair based on the distance between the interruption point positions of the second candidate fusion gene pair and the average sequencing depth of the targeted region includes:
  • the ratio between the distance between the two ends of the spanning sequencing fragment in the second fusion gene pair and the breakpoint position and the length of the sequencing fragment is used as the second factor score;
  • the ratio between the distance between the two ends of the hit sequencing fragment in the second fusion gene pair and the breakpoint position and the doubling degree of the sequencing fragment is used as the third factor score, where the doubling degree is The product of the length of the sequencing fragment and the doubling parameter;
  • the ratio between the sum of the first factor score, the second factor score, and the third factor score and the average sequencing depth of the target region is used as the fusion gene score of the second candidate fusion gene pair .
  • the targeted gene sequencing sequence is compared with the reference gene sequence to obtain the distribution and targeting of spanning sequencing fragments and hit sequencing fragments located in the targeted region in the targeted sequencing sequence. Capture results including:
  • screening out the spanning sequencing fragments in the targeted sequencing results based on the comparison results in the spanning fragment filtering conditions includes:
  • the sum value obtained by summing the length of the left-end sequencing fragment, the length of the right-end sequencing fragment, and the distance between the left and right end sequencing fragments of the sequencing fragment is greater than the difference between the quartile of the length of the sequencing fragment and the target parameter.
  • the target parameter is a parameter that controls the number and stringency of output pairings;
  • the multiple alignment values of the left-end sequencing fragment and the right-end sequencing fragment in the sequencing fragments include the positive alignment feature value and the secondary alignment feature value, and do not include the unpaired feature value of the sequencing fragment and the unpaired feature value of the other sequencing fragment. .
  • screening out hit sequencing fragments and strongly supporting hit sequencing fragments of the targeted sequencing results from the comparison results based on the hit fragment screening conditions include:
  • the alignment length of each position in the sequencing fragment is greater than a length threshold, and the alignment length is greater than one-third of the total length of the sequencing fragment;
  • the length of the sequencing fragment is the sequence of the alignment length and there is no similar sequence in the alignment result
  • the quality value of the number of comparisons of the sequencing fragments is greater than or equal to the number of comparisons
  • calculating the breakpoint position of the sequencing fragment based on the number of hit sequencing fragments and strongly supporting hit sequencing fragments includes:
  • the maximum value of the weighted sum of the number of hit sequencing fragments and the number of strongly supported hit sequencing fragments in the sequencing fragments is used as the breakpoint position.
  • the method further includes:
  • Sequences to be filtered in the targeted gene sequencing sequence are identified based on the number of bases, base quality and base length, and the sequences to be filtered are filtered.
  • identifying the linker sequence in the targeted gene sequencing sequence based on the number of bases, base quality and base length includes:
  • Sequencing sequences whose base quality is the quality threshold, the minimum base length is the base length threshold, and the average quality value of the sequencing sequence is lower than the quality threshold are used as sequences to be filtered;
  • sequence sequence containing the left-end sequencing sequence or the right-end sequencing sequence that overlaps with the adapter sequence to a preset degree of overlap is added to the sequence to be filtered.
  • Some embodiments of the present disclosure provide a device for identifying fusion genes, which device includes:
  • an acquisition module configured to acquire the target gene sequencing sequence to be identified and the reference gene sequence
  • a comparison module configured to compare the targeted gene sequencing sequence with the reference gene sequence, and obtain the distribution and targeting of spanning sequencing fragments and hit sequencing fragments located in the targeted region in the targeted sequencing sequence. capture results;
  • a screening module configured to screen out target fusion gene pairs from the sequencing fragments based on the distribution and target capture results
  • an output module configured to output identification results regarding the target fusion gene pair.
  • the filtering module is also configured to:
  • Target fusion gene pairs are screened out from the sequencing fragments based on the breakpoint position and the position across the sequencing fragments.
  • the filtering module is also configured to:
  • the filtered and retained sequencing fragments are used as candidate fusion gene pairs;
  • the filtering module is also configured to:
  • the filtering module is also configured to:
  • the ratio between the distance between the two ends of the spanning sequencing fragment in the second fusion gene pair and the breakpoint position and the length of the sequencing fragment is used as the second factor score;
  • the ratio between the distance between the two ends of the hit sequencing fragment in the second fusion gene pair and the breakpoint position and the doubling degree of the sequencing fragment is used as the third factor score, where the doubling degree is The product of the length of the sequencing fragment and the doubling parameter;
  • the ratio between the sum of the first factor score, the second factor score, and the third factor score and the average sequencing depth of the target region is used as the fusion gene score of the second candidate fusion gene pair .
  • the comparison module is also configured to:
  • the comparison module is also configured to:
  • the sum value obtained by summing the length of the left-end sequencing fragment, the length of the right-end sequencing fragment, and the distance between the left and right end sequencing fragments of the sequencing fragment is greater than the difference between the quartile of the length of the sequencing fragment and the target parameter.
  • the target parameter is a parameter that controls the number and stringency of output pairings;
  • the multiple alignment values of the left-end sequencing fragment and the right-end sequencing fragment in the sequencing fragments include the positive alignment feature value and the secondary alignment feature value, and do not include the unpaired feature value of the sequencing fragment and the unpaired feature value of the other sequencing fragment. .
  • the comparison module is also configured to:
  • the alignment length of each position in the sequencing fragment is greater than a length threshold, and the alignment length is greater than one-third of the total length of the sequencing fragment;
  • the length of the sequencing fragment is the sequence of the alignment length and there is no similar sequence in the alignment result
  • the quality value of the number of comparisons of the sequencing fragments is greater than or equal to the number of comparisons
  • the filtering module is also configured to:
  • the maximum value of the weighted sum of the number of hit sequencing fragments and the number of strongly supported hit sequencing fragments in the sequencing fragments is used as the breakpoint position.
  • the acquisition module is also configured to:
  • the sequence to be filtered in the targeted gene sequencing sequence is identified according to the base number, base quality and base length, and the sequence to be filtered is filtered.
  • the acquisition module is also configured to:
  • Sequencing sequences whose base quality is the quality threshold, the minimum base length is the base length threshold, and the average quality value of the sequencing sequence is lower than the quality threshold are used as sequences to be filtered;
  • sequence sequence containing the left-end sequencing sequence or the right-end sequencing sequence that overlaps with the adapter sequence to a preset degree of overlap is added to the sequence to be filtered.
  • Some embodiments of the present disclosure provide a computing processing device, including:
  • a memory having computer readable code stored therein;
  • One or more processors when the computer readable code is executed by the one or more processors, the computing processing device performs the identification method of the fusion gene as described above.
  • Some embodiments of the present disclosure provide a computer program, including computer readable code, which, when run on a computing processing device, causes the computing processing device to perform the identification method of a fusion gene as described above.
  • Some embodiments of the present disclosure provide a non-transitory computer-readable medium in which the method for identifying fusion genes as described above is stored.
  • Figure 1 schematically shows a flow chart of a method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 2 schematically shows one of the flow diagrams of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 3 schematically shows the second flow diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 4 schematically shows the third flow chart of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 5 schematically shows one of the principle diagrams of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 6 schematically shows the fourth schematic flow chart of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 7 schematically shows the second schematic diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 8 schematically shows the third schematic diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure.
  • Figure 9 schematically shows the fourth schematic diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure.
  • Figure 10 schematically shows the fifth schematic diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 11 schematically shows the sixth schematic diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 12 schematically shows one of the effect diagrams of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 13 schematically shows the second effect diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 14 schematically shows the third effect diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 15 schematically shows the fourth effect diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 16 schematically shows the fifth effect diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 17 schematically shows the sixth effect of another fusion gene identification method provided by some embodiments of the present disclosure.
  • Figure 18 schematically shows the seventh effect diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 19 schematically shows the eighth effect diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure.
  • Figure 20 schematically shows the ninth effect diagram of another method for identifying fusion genes provided by some embodiments of the present disclosure
  • Figure 21 schematically shows a structural diagram of a fusion gene identification device provided by some embodiments of the present disclosure
  • Figure 22 schematically illustrates a block diagram of a computing processing device for performing methods according to some embodiments of the present disclosure
  • Figure 23 schematically illustrates a storage unit for holding or carrying program code implementing methods according to some embodiments of the present disclosure.
  • WGS and RNA-seq methods have the characteristics of high-throughput, but sequencing is expensive and the large amount of data is characterized by difficulties such as storage and operation of subsequent analysis server resources and long analysis time.
  • targeted sequencing gradually becomes a mainstream detection method in tumor diagnosis, early cancer screening, reproductive genetics, and immunotherapy, it is necessary to establish tools, analysis methods, and analysis processes for gene fusion identification using targeted sequencing.
  • Figure 1 schematically shows a flow chart of a method for identifying fusion genes provided by the present disclosure.
  • the method includes:
  • Step 101 Obtain the sequencing sequence of the target gene to be identified and the reference gene sequence.
  • the targeted sequencing sequence to be identified is a gene sequencing sequence obtained by performing targeted sequencing on the sample genome in upstream experiments.
  • the targeted sequencing method can refer to the commonly used targeted sequencing methods in the field. Targeted sequencing The method of sequencing is not the focus of this disclosure and will not be described again here.
  • the reference gene sequence is a genome sequencing sequence obtained through gene sequencing of high-quality human genome.
  • low-quality data in the targeted sequencing sequence can be filtered through pre-processing methods.
  • the pre-processing methods can be, for example, excision of adapter sequences, filtering of low-quality sequences, etc. etc., which can be set according to actual needs and are not limited here.
  • Step 102 Compare the targeted gene sequencing sequence with the reference gene sequence to obtain the distribution of spanning sequencing fragments and hit sequencing fragments located in the targeted region in the targeted sequencing sequence and the target capture results.
  • a spanning sequencing fragment refers to a sequencing fragment that covers the fusion site and the left-end sequencing fragment and the right-end sequencing fragment can be compared to different genes
  • a hit sequencing fragment refers to a sequencing fragment that happens to be on Sequencing fragments at the fusion site.
  • the left-end sequencing fragment and the right-end sequencing fragment respectively refer to the fragments at the two opposite ends of the sequencing fragment.
  • the specific division of the left-end sequencing fragment and the right-end sequencing fragment can be determined based on the arrangement of the sequences in the sequencing fragment. The arrangement is different, and the left and right directions are also different. different. Align the targeted gene sequencing sequence to the reference gene sequence.
  • Step 103 Screen out target fusion gene pairs from the sequencing fragments based on the distribution and target capture results.
  • the disclosure is based on the spanning sequencing fragments and hits of each sequencing fragment in the target region.
  • the distribution characteristics such as the number of hit sequencing fragments and the proportion of bases, as well as the positional characteristics of cross-domain sequencing fragments and hit sequencing fragments are used to formulate screening rules, and each sequencing fragment in the target region is screened to select the successfully matched genes.
  • Target fusion gene pairs that meet the distribution characteristics of the fusion gene pair across sequencing fragments and hit sequencing fragments are screened out. In this way, different screening conditions can be formulated to identify fusion gene pairs according to different identification needs. There is no need to develop dedicated targeted sequencing methods for specific fusion gene types, and the identified fusion genes are no longer limited to targeted sequencing. The type of fusion genes targeted by the method improves the utilization rate of targeted sequencing data.
  • Step 104 Output the identification results regarding the target fusion gene pair.
  • the target fusion gene pair in the targeted sequencing results can be processed through the visualization module.
  • the visualization module can be For example, functional programs such as IGV and Read Map are used for visual output of gene sequencing data.
  • data such as the distribution of spanning sequencing fragments and hit sequencing fragments and the target capture results that are counted during the fusion gene identification process can also be combined and displayed as identification results to facilitate users to check and correct the identification results. .
  • the embodiments of the present disclosure perform fusion gene identification on conventional targeted sequencing, and compare the reference gene sequence to the target region of targeted sequencing to obtain the distribution and targeting of hit sequencing fragments near the target region and across the sequencing fragments. Capture the results and use the distribution characteristics of the fusion gene pairs to hit the sequencing fragments and span the sequencing fragments to screen out the fusion gene pairs in the targeted gene sequencing sequence.
  • the distribution of sequencing fragments in different fusion genes can be screened to make the targeted genes
  • the fusion gene identification results of sequencing sequences are no longer limited to specific fusion gene types, which improves the utilization of targeted gene sequencing data in gene fusion identification.
  • step 103 includes:
  • Step 1031 Calculate the breakpoint position of the sequencing fragment based on the number of hit sequencing fragments and strongly supporting hit sequencing fragments.
  • a strongly supported hit sequencing fragment is a hit sequencing fragment whose alignment positions of the left sequencing fragment and the right end sequencing fragment overlap on the genome.
  • the breakpoint position of the sequencing fragment refers to the position of the gene breakpoint of the fusion gene due to gene translocation, substitution, etc.
  • the breakpoint position can be located based on the number of identified hit sequencing fragments and strongly supporting hit sequencing fragments.
  • Step 1032 Screen out target fusion gene pairs from the sequencing fragments based on the breakpoint position and the position across the sequencing fragments.
  • sequencing fragments with breakpoints are not necessarily fusion gene pairs, it is necessary to further screen out the sequencing sequences with breakpoints based on the location of the breakpoints in the sequencing fragments and the position distribution across the sequencing fragments. Fusion gene pairs.
  • step 1032 includes:
  • Step 10321 Filter the spanning sequencing fragments that do not have support upstream and downstream of the included breakpoint position.
  • the sequencing fragment is discarded. If it exists, the breakpoint position is considered to be Reliable, the sequencing fragment is retained for subsequent analysis.
  • Step 10322 if the first end and the second end of the filtered and retained sequencing fragments are located in different genes, use the filtered and retained sequencing fragments as a candidate fusion gene pair.
  • the first end and the second end refer to the two opposite ends of the sequencing fragment.
  • the sequenced fragments are used as candidate fusion gene pairs.
  • Step 10323 Filter low-quality fusion gene pairs among the candidate fusion gene pairs to obtain the target fusion gene pair.
  • the identified candidate fusion gene pairs are further screened according to quality assessment standards to obtain target fusion gene pairs, so as to ensure the quality of the output target fusion gene pairs.
  • the quality assessment standards can be formulated through parameters such as quality score, credibility score, data accuracy, etc.
  • the specific settings can be set according to actual needs and are not limited here.
  • step 10323 includes:
  • Step 103231 Filter the paralogous genes in the candidate fusion gene pair to obtain the first candidate fusion gene pair.
  • paralogous genes refer to genes derived from gene duplication in the same species, which may evolve new but related functions to the original function.
  • identify whether the fusion genes are paralogous genes identify whether the fusion genes are paralogous genes. If they are paralogous genes, filter the candidate fusion gene pairs in the combination and not consider them as fusion gene pairs. If there are no paralogous genes, Homologous genes are retained as the first candidate fusion gene pair for further filtering.
  • Step 103232 Calculate the number of gene pairs included in the first candidate fusion gene pair.
  • Step 103233 Filter the first candidate fusion gene pairs whose number of gene pairs is greater than or equal to the gene pairing number threshold to obtain the second candidate fusion gene pair.
  • the number of pairs of the first candidate fusion gene is calculated. Refer to Figure 5. If geneA pairs with geneB, geneC, and geneD at the same time, the combination is filtered out and is not considered is a fusion gene, otherwise it is used as the second candidate fusion gene pair. Of course, this is just an illustrative explanation.
  • the threshold for the number of gene pairs here is 3.
  • the threshold for the number of gene pairs can also be other positive integers greater than 1. It can be set according to actual needs and is not limited here.
  • Step 103234 Calculate the fusion gene score of the second candidate fusion gene pair based on the distance between the interruption point positions of the second candidate fusion gene pair and the average sequencing depth of the targeted region.
  • Step 103235 Filter the second candidate fusion gene pairs whose fusion gene score is less than the fusion gene score threshold to obtain the target candidate fusion gene pair.
  • the credibility of the fusion gene pair can be measured by calculating the fusion gene score based on the fusion gene score calculation formula set based on the distance between adjacent breakpoint positions and the average sequencing depth of the targeted region. For example, the higher the fusion gene score, the higher the credibility of the fusion gene.
  • the fusion gene score threshold can be adjusted through parameters. The larger the fusion gene score threshold, the higher the credibility of the fusion gene pair.
  • the specific settings can be set according to actual needs. There are no limitations here.
  • step 103234 may include:
  • N2 Use the ratio between the distance between the two ends of the spanning sequencing fragment in the second fusion gene pair and the breakpoint position and the length of the sequencing fragment as the second factor score;
  • the ratio between the distance between the two ends of the hit sequencing fragment in the second fusion gene pair and the breakpoint position and the doubling degree of the sequencing fragment is used as the third factor score, wherein the doubling
  • the length is the product of the length of the sequencing fragment and the doubling parameter;
  • N4 Use the ratio between the sum of the first factor score, the second factor score, and the third factor score and the average sequencing depth of the target region as the fusion of the second candidate fusion gene pair. Gene score.
  • step 102 includes:
  • Step 1021 Compare the targeted gene sequencing sequence and the reference gene sequence to obtain the comparison result.
  • Step 1022 Screen out the spanning sequencing fragments in the targeted sequencing results based on the comparison results described in the spanning fragment screening conditions, and filter out the spanning sequencing fragments in the targeted sequencing results based on the comparison results described in the hit fragment filtering conditions. Hit sequencing fragments and strongly supported hit sequencing fragments.
  • each sequencing fragment in the comparison result is measured, as well as the lengths of the right and left sequencing fragments of each sequencing fragment, and the length of the left and right sequencing fragments on the genome. Distance and other parameter indicators. Then, each sequencing fragment is screened according to the parameter indicators of the counted sequencing fragments through the spanning fragment filtering conditions and the hit fragment filtering conditions to determine the spanning sequencing fragments, hit sequencing fragments, and strong support in the hit sequencing fragments in the targeted sequencing results. Hit sequencing fragment.
  • step 1022 includes the following S1 to S3:
  • the sum value obtained by summing the length of the left-end sequencing fragment, the length of the right-end sequencing fragment, and the distance between the left and right end sequencing fragments of the sequencing fragment is greater than the lower quartile of the length of the sequencing fragment and the target parameter.
  • the product of , the target parameter is a parameter that controls the number and stringency of output pairings;
  • d represents the distance on the genome between the left-end sequencing fragment R1 and the right-end sequencing fragment R2 of the sequencing fragment
  • L1 represents the length of the left-end sequencing fragment R1
  • L2 represents the length of the right-end sequencing fragment R2
  • Insert d represents the length of the sequencing fragment.
  • Quantile, C represents a parameter that controls the number and stringency of output pairings. It can be adjusted according to specific needs and can be a positive integer ranging from 10 to 100.
  • similar sequences refer to sequences in which the homology comparison result of two sequencing sequences is greater than the homology comparison result threshold.
  • the homology comparison result threshold can be set according to actual needs and is not limited here. Referring to Figure 7, there are no multiple similar sequences on the genome in any of the left-end sequencing fragments and the right-end sequencing fragments spanning the sequencing fragments, that is, there is no homologous alignment result greater than the alignment result threshold of 5, 10, or 15, for example. similarity sequence.
  • the multiple alignment values of the left-end sequencing fragment and the right-end sequencing fragment in the sequencing fragments include the positive alignment feature value and the secondary alignment feature value, and do not include the unpaired feature value of the sequencing fragment and the unpaired feature of the other sequencing fragment. Eigenvalues.
  • the multiple alignment values spanning any of the left-end sequencing fragments and the right-end sequencing fragments in the sequencing fragments only include alignment feature values (proper aligner) and secondary alignment feature values (secondary alignment), It does not include the unpaired feature value of the sequencing segment (segment unmapped) and the unpaired feature value of another sequencing segment (next segment unmapped). It should be noted that the unpaired feature value of the other sequencing segment is the unmatched segment of the current paired segment used for the sequencing frequency band. Previous and next paired fragments.
  • the present disclosure filters hit sequencing fragments by setting spanning fragment screening conditions, which can efficiently screen spanning sequencing fragments from sequencing fragments and improve the efficiency of fusion gene identification.
  • the alignment length of each position in the sequencing fragment is greater than the length threshold, and the alignment length is greater than one-third of the total length of the sequencing fragment;
  • the hit sequencing fragment needs to satisfy the following formula (2):
  • N represents the alignment length of each position in the sequencing fragment
  • 20 is the length threshold
  • L represents the total length of the sequencing fragment.
  • formula (2) here is only an illustrative explanation. The specific N and L can be determined according to actual needs. Settings are not limited here.
  • the length of the sequencing fragment is the sequence of the alignment length, and there is no similar sequence in the alignment result
  • the sequence of length N in the alignment does not have many similar sequences on the genome, that is, there is no homologous alignment result greater than the alignment result, such as 5, 10, 15. Threshold similarity sequence.
  • the quality value of the number of comparisons of the sequencing fragments is greater than or equal to the number of comparisons
  • the quality value Q of the number of comparisons is greater than or equal to the number of comparisons corresponding to 30, and is greater than 1.
  • the disclosure uses set hit fragment screening conditions to screen hit sequencing fragments, which can efficiently screen out hit sequencing fragments from the sequencing fragments and improve the efficiency of fusion gene identification.
  • step 1031 includes: taking the maximum value of the weighted sum of the number of hit sequencing fragments and the number of strongly supported hit sequencing fragments in the sequencing fragments as the breakpoint position.
  • the breakpoint position is usually related to the number of hit sequencing fragments and strongly supporting hit sequencing fragments, there may be multiple hit sequencing fragments and strongly supporting hit sequencing fragments for a certain sequencing fragment type. Therefore, when determining the breakpoint When point positioning, the breakpoint position can be determined by maximizing the number of hit sequencing fragments and strongly supported hit sequencing fragments by assigning different weight values to the number of hit sequencing fragments and strongly supported hit sequencing fragments.
  • the breakpoint position of the sequencing fragment can be calculated through the following formula (3):
  • Ii represents the breakpoint position of the i-th sequencing fragment
  • bi represents the number of hit sequencing fragments in the i-th sequencing fragment
  • Bi represents the number of strongly supported sequencing fragments in the i-th sequencing fragment
  • n represents the weight of the hit sequencing fragments.
  • value, m represents the weight value that strongly supports hitting the sequencing fragment.
  • the breakpoint position is Max (0.8bi+2.5Bi); or the weight value n of the hit sequencing fragment is 0.6, when the weight m of the strongly supported hit sequencing fragment is 3, the breakpoint position is Max(0.6bi+3Bi); or when the weight value n of the hit sequencing fragment is 7, and the weight m of the strongly supported hit sequencing fragment is 10, The breakpoint position is Max (7bi+10Bi).
  • the weight values of hit sequencing fragments and strongly supporting hit sequencing fragments can be set according to actual needs and are not limited here.
  • the method further includes:
  • Step 201 Count the number of bases, base quality and base length in the obtained target gene sequencing sequence.
  • the identification of the linker sequence can be performed by retrieving a certain number of lines, such as 10,000 lines or 15,000 lines, before the left-end sequencing sequence in the targeted gene sequencing sequence, and using the result sequence of a sequencing platform. Search to identify the proportion of various types of adapter sequences in the sequencing sequence, thereby determining the sequencing adapter sequence and proportion of adapter sequences used for sequencing.
  • Step 202 Identify sequences to be filtered in the targeted gene sequencing sequence based on the number of bases, base quality and base length, and filter the sequences to be filtered.
  • the linker sequence can be further identified based on the number of bases, base quality and base length in the result sequence, thereby filtering out linker sequences that affect the subsequent identification process to ensure that the subsequent fusion gene identification process Improve the quality of input data and improve the accuracy of fusion gene identification.
  • step 202 includes:
  • Step 2021 use the sequencing sequences whose base quality is the quality threshold, the minimum base length is the base length threshold, and the average quality value of the sequencing sequence is lower than the quality threshold as sequences to be filtered.
  • Step 2022 Supplement the sequencing sequence containing the left-end sequencing sequence or the right-end sequencing sequence that overlaps with the adapter sequence to a preset overlap level to the sequence to be filtered.
  • the data filtering standard may be to use sequencing sequences whose base quality is equal to the base quality threshold, the minimum base length is the base length threshold, and the maximum sequencing error rate is the error rate threshold as the adapter sequence.
  • the embodiments of this disclosure provide two examples of applying the above identification methods of fusion genes to specific scenarios for reference.
  • the adapter was detected to be the illumina sequencing platform adapter: ‘AGATCGGAAGAGC’, and 2.7% of the reads contained this adapter.
  • the filtered data statistics are as follows:
  • Figure 13 shows sample 1
  • Figure 14 shows sample 2
  • Figure 15 shows sample 3.
  • the insert fragment sizes of the three samples are 359, 356, and 374 respectively.
  • the C value parameter in formula (1) is preset to 10.
  • the identified Spanning read pair information at least includes: paired ReadID, the Read alignment Flag value, reference sequence name, alignment to chromosome position, alignment quality value, alignment matching status (CIGAR string, aligned reference ( Chromosome) name, position matched to the first base, library insert size, sequence fragment, sequence fragment quality value.
  • the breakpoint visualization can be seen in Figure 16, where the breakpoint was found on the ERG gene, located between 21q22.2, 38,528,440bp and 38,528,750bp.
  • the top, middle and bottom are sample 1, sample 2, and sample 3 in order; refer to Figure 17, where a breakpoint was found on the TMPRSS2 gene, located between 21q22.3, 41,507,900 and 41,508,300.
  • the top, middle and bottom are sample 1, sample 2 and sample 3.
  • Example 2 Targeted sequencing of the exome of bladder cancer identifies gene fusions:
  • the adapter was detected to be the illumina sequencing platform adapter: ‘AGATCGGAAGAGC’, and 13.25% of the reads contained this adapter.
  • the filtered data statistics are as follows:
  • Figure 21 schematically shows a structural diagram of a fusion gene identification device 30 provided by the present disclosure.
  • the device includes:
  • the acquisition module 301 is configured to acquire the target gene sequencing sequence to be identified and the reference gene sequence;
  • a comparison module configured to compare the targeted gene sequencing sequence with the reference gene sequence, and obtain the distribution and targeting of spanning sequencing fragments and hit sequencing fragments located in the targeted region in the targeted sequencing sequence. capture results;
  • the screening module 303 is configured to screen out target fusion gene pairs from the sequencing fragments based on the distribution and target capture results;
  • the output module 304 is configured to output identification results regarding the target fusion gene pair.
  • the screening module 303 is also configured to:
  • Target fusion gene pairs are screened out from the sequencing fragments based on the breakpoint position and the position across the sequencing fragments.
  • the screening module 303 is also configured to:
  • the filtered and retained sequencing fragments are used as candidate fusion gene pairs;
  • the screening module 303 is also configured to:
  • the screening module 303 is also configured to:
  • the ratio between the distance between the two ends of the spanning sequencing fragment in the second fusion gene pair and the breakpoint position and the length of the sequencing fragment is used as the second factor score;
  • the ratio between the distance between the two ends of the hit sequencing fragment in the second fusion gene pair and the breakpoint position and the doubling degree of the sequencing fragment is used as the third factor score, where the doubling degree is The product of the length of the sequencing fragment and the doubling parameter;
  • the ratio between the sum of the first factor score, the second factor score, and the third factor score and the average sequencing depth of the target region is used as the fusion gene score of the second candidate fusion gene pair .
  • the comparison module 302 is also configured to:
  • the comparison module 302 is also configured to:
  • the sum value obtained by summing the length of the left-end sequencing fragment, the length of the right-end sequencing fragment, and the distance between the left and right end sequencing fragments of the sequencing fragment is greater than the difference between the quartile of the length of the sequencing fragment and the target parameter.
  • the target parameter is a parameter that controls the number and stringency of output pairings;
  • the multiple alignment values of the left-end sequencing fragment and the right-end sequencing fragment in the sequencing fragments include the positive alignment feature value and the secondary alignment feature value, and do not include the unpaired feature value of the sequencing fragment and the unpaired feature value of the other sequencing fragment. .
  • the comparison module 302 is also configured to:
  • the alignment length of each position in the sequencing fragment is greater than a length threshold, and the alignment length is greater than one-third of the total length of the sequencing fragment;
  • the length of the sequencing fragment is the sequence of the alignment length and there is no similar sequence in the alignment result
  • the quality value of the number of comparisons of the sequencing fragments is greater than or equal to the number of comparisons
  • the screening module 303 is also configured to:
  • the maximum value of the weighted sum of the number of hit sequencing fragments and the number of strongly supported hit sequencing fragments in the sequencing fragments is used as the breakpoint position.
  • the acquisition module 301 is also configured to:
  • Sequences to be filtered in the targeted gene sequencing sequence are identified based on the number of bases, base quality and base length, and the sequences to be filtered are filtered.
  • the acquisition module 301 is also configured to:
  • Sequencing sequences whose base quality is the quality threshold, the minimum base length is the base length threshold, and the average quality value of the sequencing sequence is lower than the quality threshold are used as sequences to be filtered;
  • sequence sequence containing the left-end sequencing sequence or the right-end sequencing sequence that overlaps with the adapter sequence to a preset degree of overlap is added to the sequence to be filtered.
  • the embodiments of the present disclosure perform fusion gene identification on conventional targeted sequencing, and compare the reference gene sequence to the target region of targeted sequencing to obtain the distribution and targeting of hit sequencing fragments near the target region and across the sequencing fragments. Capture the results and use the distribution characteristics of hit sequencing fragments and spanning sequencing fragments in the fusion gene pair to screen out the fusion gene pairs in the targeted gene sequencing sequence. The distribution of sequencing fragments in different fusion genes can be screened to make the target The fusion gene identification results of gene sequencing sequences are no longer limited to specific fusion gene types, which improves the utilization of targeted gene sequencing data in gene fusion identification.
  • Various component embodiments of the present disclosure may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof.
  • a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all functions of some or all components in a computing processing device according to embodiments of the present disclosure.
  • DSP digital signal processor
  • the present disclosure may also be implemented as an apparatus or apparatus program (eg, computer program and computer program product) for performing part or all of the methods described herein.
  • Such a program implementing the present disclosure may be stored on a non-transitory computer-readable medium, or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, or provided on a carrier signal, or in any other form.
  • Figure 22 illustrates a computing processing device that may implement methods in accordance with the present disclosure.
  • the computing processing device conventionally includes a processor 410 and a computer program product in the form of memory 420 or non-transitory computer-readable medium.
  • Memory 420 may be electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • the memory 420 has a storage space 430 for program code 431 for executing any method steps in the above-described methods.
  • the storage space 430 for program codes may include individual program codes 431 respectively used to implement various steps in the above method. These program codes can be read from or written into one or more computer program products.
  • These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks. Such computer program products are typically portable or fixed storage units as described with reference to Figure 23.
  • the storage unit may have storage segments, storage spaces, etc. arranged similarly to the memory 420 in the computing processing device of FIG. 22 .
  • the program code may, for example, be compressed in a suitable form.
  • the storage unit includes computer readable code 431', ie code that can be read by, for example, a processor such as 410, which code, when executed by a computing processing device, causes the computing processing device to perform the methods described above. various steps.
  • any reference signs placed between parentheses shall not be construed as limiting the claim.
  • the word “comprising” does not exclude the presence of elements or steps not listed in a claim.
  • the word “a” or “an” preceding an element does not exclude the presence of a plurality of such elements.
  • the present disclosure may be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In the element claim enumerating several means, several of these means may be embodied by the same item of hardware.
  • the use of the words first, second, third, etc. does not indicate any order. These words can be interpreted as names.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本公开提供的融合基因的鉴定方法、装置、设备、程序及存储介质,属于基因检测技术领域。所述方法包括:获取待鉴定的靶向基因测序序列以及参考基因序列;将所述靶向基因测序序列和所述参考基因序列进行比对,获取所述靶向测序序列中位于靶向区域的跨越测序片段、命中测序片段的分布情况和靶向捕获结果;基于所述分布情况和靶向捕获结果从所述测序片段中筛选出目标融合基因对;输出关于所述目标融合基因对的鉴定结果。

Description

融合基因的鉴定方法、装置、设备、程序及存储介质 技术领域
本公开属于基因检测技术领域,特别涉及一种融合基因的鉴定方法、装置、设备、程序及存储介质。
背景技术
基因融合是染色体发生易位、删除或反转造成两个不相关的基因全部或部分序列相互融合为一个新的基因的过程。现已发现数万个基因融合。目前,已报道了很多基因融合与癌症的发生密切相关,其中,ALK(Anaplastic Lymphoma Kinase,间变性淋巴瘤激酶)、ROS1(ROS proto-oncogene 1,receptortyrosine kinase,c-ros肉瘤致癌因子-受体酪氨酸激酶)、NTRK(NeuroTrophin Receptor Kinase,神经营养因子受体络氨酸激酶)等常见融合基因作为某些癌症等诊断工具。根据最新研究报道,已鉴定到的基因融合超过1000个,其中的肿瘤驱动基因融合成为科研的热点。
概述
本公开提供的一种融合基因的鉴定方法、装置、设备、程序及存储介质。
获取待鉴定的靶向基因测序序列以及参考基因序列;
将所述靶向基因测序序列和所述参考基因序列进行比对,获取所述靶向测序序列中位于靶向区域的跨越测序片段、命中测序片段的分布情况和靶向捕获结果;
基于所述分布情况和靶向捕获结果从所述测序片段中筛选出目标融合基因对;
输出关于所述目标融合基因对的鉴定结果。
可选地,所述基于所述分布情况和靶向捕获结果从所述测序片段中筛选出目标融合基因对,包括:
根据所述命中测序片段和强支持命中测序片段的数量计算所述测序片段的断点位置;
根据所述断点位置和所述跨越测序片段的位置从所述测序片段中 筛选出目标融合基因对。
可选地,所述根据所述断点位置和所述跨越测序片段的位置从所述测序片段中筛选出目标融合基因对,包括:
过滤所包含断点位置的上下游不存在支持的所述跨越测序片段的测序片段;
在过滤后保留的测序片段的第一端和第二端位于不同的基因的情况下,将所述过滤有保留的测序片段作为候选融合基因对;
过滤所述候选融合基因对中的低质量融合基因对,得到目标融合基因对。
可选地,所述过滤所述候选融合基因对中的低质量融合基因对,得到目标融合基因对,包括:
过滤所述候选融合基因对中的旁系同源基因,得到第一候选融合基因对;
计算所述第一候选融合基因对所包含的基因配对数量;
过滤所述基因配对数量大于或等于基因配对数量阈值的第一候选融合基因对,得到第二候选融合基因对;
根据所述第二候选融合基因对中断点位置之间的距离、以及靶向区域的测序平均深度,计算所述第二候选融合基因对的融合基因得分;
过滤所述融合基因得分小于融合基因得分阈值的第二候选融合基因对,得到目标候选融合基因对。
可选地,所述根据所述第二候选融合基因对中断点位置之间的距离、以及靶向区域的测序平均深度,计算所述第二候选融合基因对的融合基因得分,包括:
将所述第二融合基因对中所述跨越测序片段距离两个断点的距离之和,与所述基因组甲基化测序序列的插入片段长度的峰值求差值,作为第一因子得分;
将所述第二融合基因对中所述跨越测序片段的两端距离断点位置的距离,与所述测序片段的长度之间的比值,作为第二因子得分;
将所述第二融合基因对中所述命中测序片段的两端距离断点位置的距离,与所述测序片段的倍增长度之间的比值,作为第三因子得分,其中,所述倍增长度为所述测序片段的长度与倍增参数的乘积;
将所述第一因子得分、所述第二因子得分、第三因子得分之和, 与所述靶向区域的测序平均深度之间的比值,作为所述第二候选融合基因对的融合基因得分。
可选地,所述将所述靶向基因测序序列和所述参考基因序列进行比对,获取所述靶向测序序列中位于靶向区域的跨越测序片段、命中测序片段的分布情况和靶向捕获结果,包括:
将所述靶向基因测序序列和所述参考基因序列进行比对,得到比对结果;
基于跨越片段筛选条件中所述比对结果筛选出所述靶向测序结果中的跨越测序片段,以及基于命中片段筛选条件中所述比对结果中筛选出所述靶向测序结果的命中测序片段和强支持命中测序片段。
可选地,所述基于跨越片段筛选条件中所述比对结果筛选出所述靶向测序结果中的跨越测序片段,包括:
从所述测序片段中筛选出同时符合下述跨越片段筛选条件的跨越测序片段:
对所述测序片段的左端测序片段长度、右端测序片段长度、左右端测序片段的距离进行求和操作所得到的和值,大于所述测序片段的长度下四分位数与目标参数之间的乘积,所述目标参数为控制输出配对数量和严格程度的参数;
所述测序片段中的左端测序片段和右端测序片段均不存在相似序列;
所述测序片段中的左端测序片段和右端测序片段的多重比对值,包含正比对特征值、二次比对特征值,且不包含测序片段未配对特征值、另一测序片段未配对特征值。
可选地,所述基于命中片段筛选条件中所述比对结果中筛选出所述靶向测序结果的命中测序片段和强支持命中测序片段,包括:
从所述测序片段中筛选出同时符合下述命中片段筛选条件的命中测序片段:
所述测序片段中每个位置的比对长度大于长度阈值,且所述比对长度大于所述测序片段的总长度的三分之一;
所述测序片段的长度为所述比对长度的序列在所述比对结果中不存在相似序列;
所述测序片段的比对次数质量值大于或等于比对次数;
对于满足上述命中片段条件的命中测序片段,在左端测序片段和右端测序片段的比对位置重叠时,确定为强支持命中测序片段。
可选地,所述根据所述命中测序片段和强支持命中测序片段的数量计算所述测序片段的断点位置,包括:
将测序片段中命中测序片段的数量、所述强支持命中测序片段的数量的加权求和的最大值作为断点位置。
可选地,在所述获取待鉴定的靶向基因测序序列以及参考基因序列之前,所述方法还包括:
统计所获取到靶向基因测序序列中的碱基数量、碱基碱基质量和碱基长度;
根据所述碱基数量、碱基质量和碱基长度识别所述靶向基因测序序列中的待过滤的序列,并对所述待过滤的序列进行过滤。
可选地,所述根据所述碱基数量、碱基碱基质量和碱基长度识别所述靶向基因测序序列中的接头序列,包括:
将碱基质量为质量阈值、最小碱基长度为碱基长度阈值、测序序列平均质量值低于质量阈值的测序序列,作为待过滤的序列;
将所包含左端测序序列或右端测序序列与所述接头序列重叠程度达到预设重叠程度的测序序列,补充到所述待过滤的序列中。
本公开一些实施例提供一种融合基因的鉴定装置,所述装置包括:
获取模块,被配置为获取待鉴定的靶向基因测序序列以及参考基因序列;
对比模块,被配置为将所述靶向基因测序序列和所述参考基因序列进行比对,获取所述靶向测序序列中位于靶向区域的跨越测序片段、命中测序片段的分布情况和靶向捕获结果;
筛选模块,被配置为基于所述分布情况和靶向捕获结果从所述测序片段中筛选出目标融合基因对;
输出模块,被配置为输出关于所述目标融合基因对的鉴定结果。
可选地,所述筛选模块,还被配置为:
根据所述命中测序片段和强支持命中测序片段的数量计算所述测序片段的断点位置;
根据所述断点位置和所述跨越测序片段的位置从所述测序片段中筛选出目标融合基因对。
可选地,所述筛选模块,还被配置为:
过滤所包含断点位置的上下游不存在支持的所述跨越测序片段的测序片段;
在过滤后保留的测序片段的第一端和第二端位于不同的基因的情况下,将所述过滤有保留的测序片段作为候选融合基因对;
过滤所述候选融合基因对中的低质量融合基因对,得到目标融合基因对。
可选地,所述筛选模块,还被配置为:
过滤所述候选融合基因对中的旁系同源基因,得到第一候选融合基因对;
计算所述第一候选融合基因对所包含的基因配对数量;
过滤所述基因配对数量大于或等于基因配对数量阈值的第一候选融合基因对,得到第二候选融合基因对;
根据所述第二候选融合基因对中断点位置之间的距离、以及靶向区域的测序平均深度,计算所述第二候选融合基因对的融合基因得分;
过滤所述融合基因得分小于融合基因得分阈值的第二候选融合基因对,得到目标候选融合基因对。
可选地,所述筛选模块,还被配置为:
将所述第二融合基因对中所述跨越测序片段距离两个断点的距离之和,与所述基因组甲基化测序序列的插入片段长度的峰值求差值,作为第一因子得分;
将所述第二融合基因对中所述跨越测序片段的两端距离断点位置的距离,与所述测序片段的长度之间的比值,作为第二因子得分;
将所述第二融合基因对中所述命中测序片段的两端距离断点位置的距离,与所述测序片段的倍增长度之间的比值,作为第三因子得分,其中,所述倍增长度为所述测序片段的长度与倍增参数的乘积;
将所述第一因子得分、所述第二因子得分、第三因子得分之和,与所述靶向区域的测序平均深度之间的比值,作为所述第二候选融合基因对的融合基因得分。
可选地,所述比对模块,还被配置为:
将所述靶向基因测序序列和所述参考基因序列进行比对,得到比对结果;
基于跨越片段筛选条件中所述比对结果筛选出所述靶向测序结果中的跨越测序片段,以及基于命中片段筛选条件中所述比对结果中筛选出所述靶向测序结果的命中测序片段和强支持命中测序片段。
可选地,所述比对模块,还被配置为:
从所述测序片段中筛选出同时符合下述跨越片段筛选条件的跨越测序片段:
对所述测序片段的左端测序片段长度、右端测序片段长度、左右端测序片段的距离进行求和操作所得到的和值,大于所述测序片段的长度下四分位数与目标参数之间的乘积,所述目标参数为控制输出配对数量和严格程度的参数;
所述测序片段中的左端测序片段和右端测序片段均不存在相似序列;
所述测序片段中的左端测序片段和右端测序片段的多重比对值,包含正比对特征值、二次比对特征值,且不包含测序片段未配对特征值、另一测序片段未配对特征值。
可选地,所述比对模块,还被配置为:
从所述测序片段中筛选出同时符合下述命中片段筛选条件的命中测序片段:
所述测序片段中每个位置的比对长度大于长度阈值,且所述比对长度大于所述测序片段的总长度的三分之一;
所述测序片段的长度为所述比对长度的序列在所述比对结果中不存在相似序列;
所述测序片段的比对次数质量值大于或等于比对次数;
对于满足上述命中片段条件的命中测序片段,在左端测序片段和右端测序片段的比对位置重叠时,确定为强支持命中测序片段。
可选地,所述筛选模块,还被配置为:
将测序片段中命中测序片段的数量、所述强支持命中测序片段的数量的加权求和的最大值作为断点位置。
可选地,所述获取模块,还被配置为:
统计所获取到靶向基因测序序列中的碱基数量、碱基碱基质量和碱基长度;
根据所述碱基数量、碱基质量和碱基长度识别所述靶向基因测序 序列中的待过滤的序列,并对所述待过滤的序列进行过滤。
可选地,所述获取模块,还被配置为:
将碱基质量为质量阈值、最小碱基长度为碱基长度阈值、测序序列平均质量值低于质量阈值的测序序列,作为待过滤的序列;
将所包含左端测序序列或右端测序序列与所述接头序列重叠程度达到预设重叠程度的测序序列,补充到所述待过滤的序列中。
本公开一些实施例提供一种计算处理设备,包括:
存储器,其中存储有计算机可读代码;
一个或多个处理器,当所述计算机可读代码被所述一个或多个处理器执行时,所述计算处理设备执行如上述所述的融合基因的鉴定方法。
本公开一些实施例提供一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算处理设备上运行时,导致所述计算处理设备执行如上述的融合基因的鉴定方法。
本公开一些实施例提供一种非瞬态计算机可读介质,其中存储了如上述的融合基因的鉴定方法。
上述说明仅是本公开技术方案的概述,为了能够更清楚了解本公开的技术手段,而可依照说明书的内容予以实施,并且为了让本公开的上述和其它目的、特征和优点能够更明显易懂,以下特举本公开的具体实施方式。
附图简述
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示意性地示出了本公开一些实施例提供的一种融合基因的鉴定方法的流程示意图;
图2示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的流程示意图之一;
图3示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的流程示意图之二;
图4示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定 方法的流程示意图之三;
图5示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的原理示意图之一;
图6示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的流程示意图之四;
图7示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的原理示意图之二;
图8示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的原理示意图之三;
图9示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的原理示意图之四;
图10示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的原理示意图之五;
图11示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的原理示意图之六;
图12示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的效果示意图之一;
图13示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的效果示意图之二;
图14示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的效果示意图之三;
图15示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的效果示意图之四;
图16示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的效果示意图之五;
图17示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的效果示意图之六;
图18示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的效果示意图之七;
图19示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定方法的效果示意图之八;
图20示意性地示出了本公开一些实施例提供的另一种融合基因的鉴定 方法的效果示意图之九;
图21示意性地示出了本公开一些实施例提供的一种融合基因的鉴定装置的结构示意图;
图22示意性地示出了用于执行根据本公开一些实施例的方法的计算处理设备的框图;
图23示意性地示出了用于保持或者携带实现根据本公开一些实施例的方法的程序代码的存储单元。
详细描述
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
随着未来精准医疗的发展,分子诊断方法用于鉴定基因融合将成为必然趋势。相关技术中鉴定基因融合主要通过两种测序方式:WGS(Whole Genome Sequencing,全基因组测序)和RNA-seq(转录组测序技术),这两种方式及常规荧光原位杂交等方式都有着各种优缺点,具体可参见表1。荧光原位杂交等方式存在通量低的特点,但通常肿瘤样本材料通常有限,因此用低通量的方式检测多种融合存在困难。而WGS和RNA-seq方式有着高通量的特点,但测序费用昂贵,数据量大的特点为后续分析服务器资源的存储和运算、分析时长长等困难。随着靶向测序在肿瘤诊断、癌症早筛、生殖遗传、免疫治疗等方向逐渐成为主流检测方式,因此需要靶向测序进行基因融合鉴定的工具、分析方法、分析流程等建立。
目前已有的靶向测序用于鉴定基因融合的方法均具有一定的使用局限性,这类方法大多基于已有的融合基因结果,鉴定一组基因融合或一种癌症的融合基因。可见这些鉴定方法是仅针对融合基因设计探针以鉴定融合,且大多方法仅能适用于已知融合基因而不能鉴定未知融合基因,而常规的靶向测序是基于目标区域设计探针,用于鉴定目标区间的体细胞/生殖细胞突变、基因融合、拷贝数变异、染色体大片段变异、肿瘤突变负荷、肿瘤微卫星不稳定等,因此上述检测方法在实际使用具有局限性,不适用于常规的各类目标区间中的可能存在的 融合基因鉴定。
Figure PCTCN2022083275-appb-000001
表1
图1示意性地示出了本公开提供的一种融合基因的鉴定方法的流程示意图,所述方法包括:
步骤101,获取待鉴定的靶向基因测序序列以及参考基因序列。
在本公开实施例中,待鉴定的靶向测序序列是上游实验通过对样本基因组进行靶向测序采集得到的基因测序序列,靶向测序的方式可参照本领域的常用靶向测序方式,靶向测序的方式不是本公开的关注重点,此处不再赘述。参考基因序列是通过高质量的人类基因组进行基因测序得到的基因组测序序列。
在本公开的一些实施例中,在获取到靶向测序数据后,可通过预处理方式过滤靶向测序序列中的低质量数据,该预处理方式可以是例如接头序列切除、过滤低质量序列等等,具体可以根据实际需求设置,此处不做限定。
步骤102,将所述靶向基因测序序列和所述参考基因序列进行比对,获取所述靶向测序序列中位于靶向区域的跨越测序片段、命中测序片段的分布情况和靶向捕获结果。
在本公开实施例中,跨越测序片段(Spanning Read)是指覆盖融合位点且左端测序片段和右端测序片段可以比对到不同基因上的测序片段,命中测序片段(Split Read)是指恰巧在融合位点上的测序片段。左端测序片段和右端测序片段分别是指测序片段中位置相对的两个末端的片段,具体左端测序片段和右端测序片段的划分可以依据测序片段中序列的排列方式确定,排列方式不同,左右方向也不同。将靶向基因测序序列比对到参考基因序列上,如是人类DNA(DeoxyriboNucleicAcid,脱氧核糖核酸)样本可选用Hg19或者CRCh38版本的参考基因序列,比对工具可选用BWAMEN,比对完成后可将比对结果按照参考基因组的顺序进行线性排序,并存储为bam格式。然后依据比对结果中双端测序的Read1序列和Read2序列的配对关系,计算靶向测序序列中配对成功的测序片段在靶向区域的分布情况和靶向捕获结果,靶向捕获结果是指靶向区域中各测序片段的位置,以供后续鉴定融合基因使用。
步骤103,基于所述分布情况和靶向捕获结果从所述测序片段中筛选出目标融合基因对。
在本公开实施例中,考虑到融合基因对中跨越测序片段和命中测序片段的分布情况和位置与非融合基因存在明显区别,因此本公开通过依据靶向区域中各测序片段的跨越测序片段和命中测序片段的数量、碱基占比等分布情况的特点,以及跨域测序片段和命中测序片段的位置特征来制定筛选规则,对靶向区域中各测序片段进行筛选,从而从配对成功的基因对中筛选出符合融合基因对的跨越测序片段和命中测序片段分布特征的目标融合基因对。通过这种方式可针对不同的鉴定需求制定不同的筛选条件来对融合基因对进行识别,无需再针对专门的融合基因种类开发专用的靶向测序方式,鉴定的融合基因也再不局限于靶向测序方式所针对的融合基因种类,提高了靶向测序数据的利用率。
步骤104,输出关于所述目标融合基因对的鉴定结果。
在本公开实施例中,在鉴定出目标融合基因对后,为使得用户可以直观地查看都鉴定结果,可通过对靶向测序结果中的目标融合基因对通过可视化模块进行处理,可视化模块可以是例如IGV、Read Map等用于对基因测序数据进行可视化输出的功能程序。进一步的,还可以对融合基因鉴定过程中统计到的跨越测序片段和命中测序片段的分布情况和靶向捕获结果等数据一并作为鉴定结果进行合并展示,以便于用户对鉴定结果进行核查和校正。
本公开实施例通过对常规靶向测序进行融合基因鉴定,通过将参考基因序列比对至靶向测序的靶向区域,以获得靶向区域附近命中测序片段和跨越测序片段的分布情况和靶向捕获结果,利用融合基因对命中测序片段和跨越测序片段的分布特征,筛选出靶向基因测序序列中的融合基因对,可针对不同的融合基因的中测序片段的分布进行筛选,使得靶向基因测序序列的融合基因鉴定结果不再局限于特定融合基因种类,提高了靶向基因测序数据在基因融合鉴定中的利用率。
可选地,参照图2,所述步骤103,包括:
步骤1031,根据所述命中测序片段和强支持命中测序片段的数量计算所述测序片段的断点位置。
在本公开实施例中,强支持命中测序片段是左测序片段和右端测序片段的比对位置在基因组上重叠的命中测序片段。测序片段的断点位置是指融合基因对由于基因易位、替换等原因导致的基因断点的位置。可通过依据鉴定到的命中测序片段和强支持命中测序片段的数量来对断点位置进行定位。
步骤1032,根据所述断点位置和所述跨越测序片段的位置从所述测序片段中筛选出目标融合基因对。
在本公开实施例中,由于存在断点的测序片段不一定是融合基因对,因此还需通过测序片段中断点位置与跨越测序片段的位置分布情况来进一步从存在断点的测序序列中筛选出融合基因对。
可选地,参照图3,所述步骤1032,包括:
步骤10321,过滤所包含断点位置的上下游不存在支持的所述跨越测序片段的测序片段。
在本公开实施例中,若在断点位置前后端,长度为测序片段长度的下四分位数据的长度范围内无跨越测序片段,则舍弃该测序片段,若存在则认为该断点位置是可靠的,保留该测序片段,以供后续继续分析使用。
步骤10322,在过滤后保留的测序片段的第一端和第二端位于不同的基 因的情况下,将所述过滤有保留的测序片段作为候选融合基因对。
在本公开实施例中,第一端和和第二端是指测序片段的位置相对的两个末端。对鉴定到的断点位置的测序片段的第一端和第二端,也就是左右两端进行注释,即通过对应基因组版本的GFF3格式文件,鉴定左右两端位于的基因位置,当前仅当左右两端位于不同基因时,将测序片段作为候选融合基因对。
步骤10323,过滤所述候选融合基因对中的低质量融合基因对,得到目标融合基因对。
在本公开实施例中,对鉴定到的候选融合基因对,通过质量评估标准进行进一步筛选得到目标融合基因对,以保证所输出的目标融合基因对的质量。加质量评估标准可以是通过质量评分、可信度评分等、数据精度等参数制定,具体可根据实际需求设置,此处不做限定。
可选地,参照图4,所述步骤10323,包括:
步骤103231,过滤所述候选融合基因对中的旁系同源基因,得到第一候选融合基因对。
在本公开实施例中,旁系同源基因是指在同一物种中的来源于基因复制的基因,可能会进化出新的但与原来功能相关的功能。对鉴定到的候选融合基因,鉴定融合基因之间是否为旁系同源基因,若是旁系同源基因则过滤该组合中的候选融合基因对,不认为是融合基因对,若不存在旁系同源基因,则保留作为第一候选融合基因对,以供进一步过滤使用。
步骤103232,计算所述第一候选融合基因对所包含的基因配对数量。
步骤103233,过滤所述基因配对数量大于或等于基因配对数量阈值的第一候选融合基因对,得到第二候选融合基因对。
在本公开实施例中,对于鉴定到的第一候选融合基因,计算第一候选融合基因的配对数量,参照图5,其中若geneA同时配对geneB、geneC、geneD,则过滤掉该组合,不认为是融合基因,否则作为第二候选融合基因对。当然此处只是示例性说明,此处的基因配对数量阈值为3,该基因配对数量阈值还可以是大于1的其他正整数,具体可以根据实际需求设置,此处不做限定。
步骤103234,根据所述第二候选融合基因对中断点位置之间的距离、以及靶向区域的测序平均深度,计算所述第二候选融合基因对的融合基因得分。
步骤103235,过滤所述融合基因得分小于融合基因得分阈值的第二候选融合基因对,得到目标候选融合基因对。
在本公开实施例中,可通过依据相邻断点位置之间的距离和靶向区域的测序平均深度设置的融合基因得分计算公式计算得到的融合基因得分来衡量融合基因对的可信度。例如融合基因得分越高可认为合基因的可信度越高,融合基因得分阈值可通过参数调整,融合基因得分阈值越大,融合基因对的可信度越高,具体可以根据实际需求设置,此处不做限定。
可选地,所述步骤103234,可以包括:
N1、将所述第二融合基因对中所述跨越测序片段距离两个断点的距离之和,与所述基因组甲基化测序序列的插入片段长度的峰值求差值,作为第一因子得分;
N2、将所述第二融合基因对中所述跨越测序片段的两端距离断点位置的距离,与所述测序片段的长度之间的比值,作为第二因子得分;
N3、将所述第二融合基因对中所述命中测序片段的两端距离断点位置的距离,与所述测序片段的倍增长度之间的比值,作为第三因子得分,其中,所述倍增长度为所述测序片段的长度与倍增参数的乘积;
N4、将所述第一因子得分、所述第二因子得分、第三因子得分之和,与所述靶向区域的测序平均深度之间的比值,作为所述第二候选融合基因对的融合基因得分。
可选地,参照图6,所述步骤102,包括:
步骤1021,将所述靶向基因测序序列和所述参考基因序列进行比对,得到比对结果。
步骤1022,基于跨越片段筛选条件中所述比对结果筛选出所述靶向测序结果中的跨越测序片段,以及基于命中片段筛选条件中所述比对结果中筛选出所述靶向测序结果的命中测序片段和强支持命中测序片段。
在本公开实施例中,根据比对结果,测量出比对结果中每个测序片段的长度,以及每个测序片段的右端测序片段和左端测序片段的长度,以及左右端测序片段在基因组上的距离等参数指标。然后依据所统计到的测序片段的参数指标通过跨越片段筛选条件和命中片段筛选条件对各测序片段进行筛选来确定靶向测序结果中的跨越测序片段、命中测序片段以及命中测序片段中的强支持命中测序片段。
可选地,所述步骤1022,包括下述S1~S3:
S1,从所述测序片段中筛选出同时符合下述跨越片段筛选条件A1~A3的跨越测序片段:
A1、对所述测序片段的左端测序片段长度、右端测序片段长度、左右端测序片段的距离进行求和操作所得到的和值,大于所述测序片段的长度下四分位数与目标参数之间的乘积,所述目标参数为控制输出配对数量和严格程度的参数;
在本公开实施例中,跨越测序片段需满足下述公式(1)
d+L 1+L 2>Insert d×C         (1)
其中,d表示测序片段的中左端测序片段R1和右端测序片段R2在基因组上的距离,L1表示左端测序片段R1的长度,L2表示右端测序片段R2的长度,Insert d表示测序片段的长度下四分位数,C表示为控制输出配对数量和严格程度的参数,可通过具体需求调整,可以从10~100范围内的正整数。
A2、所述测序片段中的左端测序片段和右端测序片段均不存在相似序列;
在本公开实施例中,相似序列是指两个测序序列的同源比对结果大于同源比对结果阈值的序列,同源比对结果阈值可以根据实际需求设置,此处不做限定。参照图7,跨越测序片段的左端测序片段和右端测序片段中任一不存在基因组上多个相似性序列,即不存在同源比对结果大于比对例如5、10、15的比对结果阈值的相似性序列。
A3、所述测序片段中的左端测序片段和右端测序片段的多重比对值,包含正比对特征值、二次比对特征值,且不包含测序片段未配对特征值、另一测序片段未配对特征值。
在本公开实施例中,跨越测序片段中的左端测序片段和右端测序片段中的任一的多重比对值仅包含比对特征值(proper aligner)、二次比对特征值(secondary alignment),不包含测序片段未配对特征值(segment unmapped)、另一测序片段未配对特征值(next segmentunmapped),需要说明的是另一测序片段未配对特征值是用于测序频段的当前配对片段未比对上下一个配对片段。
本公开通过设置的跨越片段筛选条件来对命中测序片段进行筛选,可高效地从测序片段中筛选跨越测序片段,提高了融合基因鉴定的效率。
S2,从所述测序片段中筛选出同时符合下述命中片段筛选条件B1~B4的命中测序片段:
B1、所述测序片段中每个位置的比对长度大于长度阈值,且所述比对长度大于所述测序片段的总长度的三分之一;
在本公开实施例中,命中测序片段需满足下述公式(2):
N>20且N>L/3              (2)
其中,N表示测序片段中每个位置的比对长度,20为长度阈值,L表示测序片段的总长度,当然此处公式(2)仅是示例性说明,具体的N和L可以根据实际需求设置,此处不做限定。
B2、所述测序片段的长度为所述比对长度的序列在所述比对结果中不存在相似序列;
在本公开实施例中,参照图8,比对上的长度为N的序列在基因组上无多少个相似序列,即不存在同源比对结果大于比对例如5、10、15的比对结果阈值的相似性序列。
B3、所述测序片段的比对次数质量值大于或等于比对次数;
在本公开实施例中,比对次数质量值Q大于等于30对应的比对次数,且大于1。
本公开通过设置的命中片段筛选条件来对命中测序片段进行筛选,可高效地从测序片段中筛选出命中测序片段,提高了融合基因鉴定的效率。
S3,对于满足上述命中片段条件的命中测序片段,在左端测序片段和右端测序片段的比对位置重叠时,确定为强支持命中测序片段。
在本公开实施例中,参照图9,其中Spanning read(上图);split read(中图和下图),其中中图表示R1和R2没有重叠(overlap),下图中R1和R2重叠。当上述条件B1至B3筛选出的命中测序片段的左端测序片段和右端测序片段的比对为位置在基因组上重叠,即测序区域重叠,将该类配对记为命中测序片段中的强支持命中测序片段。
可选地,所述步骤1031,包括:将测序片段中命中测序片段的数量、所述强支持命中测序片段的数量的加权求和的最大值作为断点位置。
在本公开实施例中,由于断点位置通常与命中测序片段和强支持命中测序片段的数量相关,但是对于某测序片段种可能存在多个命中测序片段和强支持命中测序片段,因此在确定断点位置时可以通过分别给命中测序片段和强支持命中测序片段的数量分配不同的权重值来将命中测序片段和强支持命中测序片段的数量求最大值来确定断点位置。
具体的,可通过下述公式(3)计算测序片段的断点位置:
Ii=Max(n*bi+m*Bi)          (3)
其中,Ii表示第i个测序片段的断点位置,bi表示第i个测序片段中命 中测序片段的数量,Bi表示第i个测序片段中强支持测序片段的数量,n表示命中测序片段的权重值,m表示强支持命中测序片段的权重值。
示例性的,在命中测序片段的权重值n为0.8,强支持命中测序片段的权重m为2.5时,断点位置为Max(0.8bi+2.5Bi);或者是命中测序片段的权重值n为0.6,强支持命中测序片段的权重m为3时,断点位置为Max(0.6bi+3Bi);或者是命中测序片段的权重值n为7,强支持命中测序片段的权重m为10时,断点位置为Max(7bi+10Bi),当然此处仅是示例性说明,命中测序片段和强支持命中测序片段的权重值具体可以根据实际需求设置,此处不做限定。
可选地,参照图10,在所述步骤101之前,所述方法还包括:
步骤201,统计所获取到靶向基因测序序列中的碱基数量、碱基碱基质量和碱基长度。
在本公开的一些实施例中,对于接头序列的鉴定,可通过调取靶向基因测序序列中的左端测序序列的前例如10000行或15000行等若干数量行,用个测序平台的结果序列进行检索,鉴定出测序序列中各类接头序列的占比,从而确定所测序所用测序接头序列及接头序列占比。
步骤202,根据所述碱基数量、碱基质量和碱基长度识别所述靶向基因测序序列中的待过滤的序列,并对所述待过滤的序列进行过滤。
在本公开实施例中,具体可以依据结果序列中碱基数量、碱基质量和碱基长度对接头序列进行进一步识别,从而过滤掉影响后续鉴定过程的接头序列,以保证后续融合基因鉴定过程中输入数据的质量,提高融合基因鉴定的准确性。
可选地,参照图11,所述步骤202,包括:
步骤2021,将碱基质量为质量阈值、最小碱基长度为碱基长度阈值、测序序列平均质量值低于质量阈值的测序序列,作为待过滤的序列。
步骤2022,将所包含左端测序序列或右端测序序列与所述接头序列重叠程度达到预设重叠程度的测序序列,补充到所述待过滤的序列中。
在本公开实施例中,数据过滤标准可以是将碱基质量都等于碱基质量阈值、最小碱基长度为碱基长度阈值,最大测序错误率为错误率阈值的测序序列作为接头序列。还可以进一步将左右端测序序列重叠,且重叠区域的长度大于或等于例如3bp的预设重叠程度的测序序列作为接头序列,将接头序列进行切除,以保证后续融合基因鉴定过程中输入数据的质量,提高融合基因 鉴定的准确性。
示例性的,本公开实施例提供将上述融合基因的鉴定方式应用到具体场景中的两个实施例以供参考。
示例一、靶向测序外显子组鉴定基因融合突变MPRSS2-ERG:
对原始数据进行预处理,获得原始数据测序数据量,统计结果如下:
Figure PCTCN2022083275-appb-000002
检测出接头为illumina测序平台接头:‘AGATCGGAAGAGC’,并有2.7%的reads包含该接头。按照发明内容中S1预设的过滤条件,过滤完的数据统计如下:
Figure PCTCN2022083275-appb-000003
数据处理后获得高质量数据,质量值均在30以上可参照图12。
对过滤后的高质量数据比对至参考基因组GRCH38,并存储为bam格式,计算该组测序数据插入片段范围。
其中,图13为样本1,图14为样本2,图15为样本3,三样本插入片段大小分别为359、356、374。
根据比对结果,进行Spanning read(跨越测序片段)筛选。
首先计算各配对read的距离d及长度L1、L2,公式(1)中的C值参数预设为10。
鉴定到的Spanning reads对信息至少包括:配对ReadID、该Read比对Flag值、参考序列名称、比对至染色体位置、比对质量值、比对匹配情况 (CIGAR字符串、比对到的参考(染色体)名字、配对到第一个碱基的位置文库插入片段大小、序列片段、序列片段质量值。
根据spanning reads和splitreads(命中测序片段)位置信息,鉴定到21号染色体38,528,404附近有断点和21号染色体38528747附近有断点。对该位置使用GFF3注释,注释到该位置为ERG基因。同时在21号染色体42,508,100附近和21号染色体42,508,215bp附近有断点,对该位置进行基因注释,注释到该位置为TMPRSS2基因。使用同源基因注释,证明两个基因不是旁系同源基因。根据融合基因得分的计算公式计算出该融合得分为1742,满足融合基因鉴定得分需求。
断点可视化可参照图16,其中在ERG基因上发现断点,位置为21q22.2,38,528,440bp~38,528,750bp之间。上中下依次为样本1、样本2、样本3;参照图17,其中在TMPRSS2基因上发现断点,位置为21q22.3,41,507,900~41,508,300之间。上中下依次为样本1、样本2、样本3。
示例二、靶向测序膀胱癌的外显子组鉴定基因融合:
对原始数据进行预处理,获得原始数据测序数据量,统计结果如下:
Figure PCTCN2022083275-appb-000004
检测出接头为illumina测序平台接头:‘AGATCGGAAGAGC’,并有13.25%的reads包含该接头。按照发明内容中S1预设的过滤条件,过滤完的数据统计如下:
Figure PCTCN2022083275-appb-000005
数据处理后获得高质量数据,两个样本的碱基质量值基本均在30以上,95~95bp间部分碱基均值接近30临界值可参照图18。对过滤后的高质量数据比对至参考基因组GRCH38,并存储为bam格式,计算该组测序数据插入片段范围。三样本插入片段大小分别为142、147,如图19为样本1,图20 为样本2。
因未鉴定到同时满足Spanning reads和split reads的断点,因此样本数据中未存在基因融合。
图21示意性地示出了本公开提供的一种融合基因的鉴定装置30的结构示意图,所述装置包括:
获取模块301,被配置为获取待鉴定的靶向基因测序序列以及参考基因序列;
对比模块,被配置为将所述靶向基因测序序列和所述参考基因序列进行比对,获取所述靶向测序序列中位于靶向区域的跨越测序片段、命中测序片段的分布情况和靶向捕获结果;
筛选模块303,被配置为基于所述分布情况和靶向捕获结果从所述测序片段中筛选出目标融合基因对;
输出模块304,被配置为输出关于所述目标融合基因对的鉴定结果。
可选地,所述筛选模块303,还被配置为:
根据所述命中测序片段和强支持命中测序片段的数量计算所述测序片段的断点位置;
根据所述断点位置和所述跨越测序片段的位置从所述测序片段中筛选出目标融合基因对。
可选地,所述筛选模块303,还被配置为:
过滤所包含断点位置的上下游不存在支持的所述跨越测序片段的测序片段;
在过滤后保留的测序片段的第一端和第二端位于不同的基因的情况下,将所述过滤有保留的测序片段作为候选融合基因对;
过滤所述候选融合基因对中的低质量融合基因对,得到目标融合基因对。
可选地,所述筛选模块303,还被配置为:
过滤所述候选融合基因对中的旁系同源基因,得到第一候选融合基因对;
计算所述第一候选融合基因对所包含的基因配对数量;
过滤所述基因配对数量大于或等于基因配对数量阈值的第一候选融合基因对,得到第二候选融合基因对;
根据所述第二候选融合基因对中断点位置之间的距离、以及靶向区域的测序平均深度,计算所述第二候选融合基因对的融合基因得分;
过滤所述融合基因得分小于融合基因得分阈值的第二候选融合基因对,得到目标候选融合基因对。
可选地,所述筛选模块303,还被配置为:
将所述第二融合基因对中所述跨越测序片段距离两个断点的距离之和,与所述基因组甲基化测序序列的插入片段长度的峰值求差值,作为第一因子得分;
将所述第二融合基因对中所述跨越测序片段的两端距离断点位置的距离,与所述测序片段的长度之间的比值,作为第二因子得分;
将所述第二融合基因对中所述命中测序片段的两端距离断点位置的距离,与所述测序片段的倍增长度之间的比值,作为第三因子得分,其中,所述倍增长度为所述测序片段的长度与倍增参数的乘积;
将所述第一因子得分、所述第二因子得分、第三因子得分之和,与所述靶向区域的测序平均深度之间的比值,作为所述第二候选融合基因对的融合基因得分。
可选地,所述比对模块302,还被配置为:
将所述靶向基因测序序列和所述参考基因序列进行比对,得到比对结果;
基于跨越片段筛选条件中所述比对结果筛选出所述靶向测序结果中的跨越测序片段,以及基于命中片段筛选条件中所述比对结果中筛选出所述靶向测序结果的命中测序片段和强支持命中测序片段。
可选地,所述比对模块302,还被配置为:
从所述测序片段中筛选出同时符合下述跨越片段筛选条件的跨越测序片段:
对所述测序片段的左端测序片段长度、右端测序片段长度、左右端测序片段的距离进行求和操作所得到的和值,大于所述测序片段的长度下四分位数与目标参数之间的乘积,所述目标参数为控制输出配对数量和严格程度的参数;
所述测序片段中的左端测序片段和右端测序片段均不存在相似序列;
所述测序片段中的左端测序片段和右端测序片段的多重比对值,包含正比对特征值、二次比对特征值,且不包含测序片段未配对特征值、另一测序片段未配对特征值。
可选地,所述比对模块302,还被配置为:
从所述测序片段中筛选出同时符合下述命中片段筛选条件的命中测序片段:
所述测序片段中每个位置的比对长度大于长度阈值,且所述比对长度大于所述测序片段的总长度的三分之一;
所述测序片段的长度为所述比对长度的序列在所述比对结果中不存在相似序列;
所述测序片段的比对次数质量值大于或等于比对次数;
对于满足上述命中片段条件的命中测序片段,在左端测序片段和右端测序片段的比对位置重叠时,确定为强支持命中测序片段。
可选地,所述筛选模块303,还被配置为:
将测序片段中命中测序片段的数量、所述强支持命中测序片段的数量的加权求和的最大值作为断点位置。
可选地,所述获取模块301,还被配置为:
统计所获取到靶向基因测序序列中的碱基数量、碱基碱基质量和碱基长度;
根据所述碱基数量、碱基质量和碱基长度识别所述靶向基因测序序列中的待过滤的序列,并对所述待过滤的序列进行过滤。
可选地,所述获取模块301,还被配置为:
将碱基质量为质量阈值、最小碱基长度为碱基长度阈值、测序序列平均质量值低于质量阈值的测序序列,作为待过滤的序列;
将所包含左端测序序列或右端测序序列与所述接头序列重叠程度达到预设重叠程度的测序序列,补充到所述待过滤的序列中。
本公开实施例通过对常规靶向测序进行融合基因鉴定,通过将参考基因序列比对至靶向测序的靶向区域,以获得靶向区域附近命中测序片段和跨越测序片段的分布情况和靶向捕获结果,利用融合基因对中命中测序片段和跨越测序片段的分布特征,筛选出靶向基因测序序列中的融合基因对,可针对不同的融合基因的中测序片段的分布进行筛选,使得靶向基因测序序列的融合基因鉴定结果不再局限于特定融合基因种类,提高了靶向基因测序数据在基因融合鉴定中的利用率。
本公开的各个部件实施例可以以硬件实现,或者以在一个或者多个处理 器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本公开实施例的计算处理设备中的一些或者全部部件的一些或者全部功能。本公开还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本公开的程序可以存储在非瞬态计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
例如,图22示出了可以实现根据本公开的方法的计算处理设备。该计算处理设备传统上包括处理器410和以存储器420形式的计算机程序产品或者非瞬态计算机可读介质。存储器420可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器420具有用于执行上述方法中的任何方法步骤的程序代码431的存储空间430。例如,用于程序代码的存储空间430可以包括分别用于实现上面的方法中的各种步骤的各个程序代码431。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图23所述的便携式或者固定存储单元。该存储单元可以具有与图22的计算处理设备中的存储器420类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码431’,即可以由例如诸如410之类的处理器读取的代码,这些代码当由计算处理设备运行时,导致该计算处理设备执行上面所描述的方法中的各个步骤。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本公开的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本公开的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本公开可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
最后应说明的是:以上实施例仅用以说明本公开的技术方案,而非对其限制;尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本公开各实施例技术方案的精神和范围。

Claims (15)

  1. 一种融合基因的鉴定方法,其特征在于,所述方法包括:
    获取待鉴定的靶向基因测序序列以及参考基因序列;
    将所述靶向基因测序序列和所述参考基因序列进行比对,获取所述靶向测序序列中位于靶向区域的跨越测序片段、命中测序片段的分布情况和靶向捕获结果;
    基于所述分布情况和靶向捕获结果从所述测序片段中筛选出目标融合基因对;
    输出关于所述目标融合基因对的鉴定结果。
  2. 根据权利要求1所述的方法,其特征在于,所述基于所述分布情况和靶向捕获结果从所述测序片段中筛选出目标融合基因对,包括:
    根据所述命中测序片段和强支持命中测序片段的数量计算所述测序片段的断点位置;
    根据所述断点位置和所述跨越测序片段的位置从所述测序片段中筛选出目标融合基因对。
  3. 根据权利要去2所述的方法,其特征在于,所述根据所述断点位置和所述跨越测序片段的位置从所述测序片段中筛选出目标融合基因对,包括:
    过滤所包含断点位置的上下游不存在支持的所述跨越测序片段的测序片段;
    在过滤后保留的测序片段的第一端和第二端位于不同的基因的情况下,将所述过滤有保留的测序片段作为候选融合基因对;
    过滤所述候选融合基因对中的低质量融合基因对,得到目标融合基因对。
  4. 根据权利要求3所述的方法,其特征在于,所述过滤所述候选融合基因对中的低质量融合基因对,得到目标融合基因对,包括:
    过滤所述候选融合基因对中的旁系同源基因,得到第一候选融合基因对;
    计算所述第一候选融合基因对所包含的基因配对数量;
    过滤所述基因配对数量大于或等于基因配对数量阈值的第一候选融合基因对,得到第二候选融合基因对;
    根据所述第二候选融合基因对中断点位置之间的距离、以及靶向 区域的测序平均深度,计算所述第二候选融合基因对的融合基因得分;
    过滤所述融合基因得分小于融合基因得分阈值的第二候选融合基因对,得到目标候选融合基因对。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述第二候选融合基因对中断点位置之间的距离、以及靶向区域的测序平均深度,计算所述第二候选融合基因对的融合基因得分,包括:
    将所述第二融合基因对中所述跨越测序片段距离两个断点的距离之和,与所述基因组甲基化测序序列的插入片段长度的峰值求差值,作为第一因子得分;
    将所述第二融合基因对中所述跨越测序片段的两端距离断点位置的距离,与所述测序片段的长度之间的比值,作为第二因子得分;
    将所述第二融合基因对中所述命中测序片段的两端距离断点位置的距离,与所述测序片段的倍增长度之间的比值,作为第三因子得分,其中,所述倍增长度为所述测序片段的长度与倍增参数的乘积;
    将所述第一因子得分、所述第二因子得分、第三因子得分之和,与所述靶向区域的测序平均深度之间的比值,作为所述第二候选融合基因对的融合基因得分。
  6. 根据权利要求2所述的方法,其特征在于,所述将所述靶向基因测序序列和所述参考基因序列进行比对,获取所述靶向测序序列中位于靶向区域的跨越测序片段、命中测序片段的分布情况和靶向捕获结果,包括:
    将所述靶向基因测序序列和所述参考基因序列进行比对,得到比对结果;
    基于跨越片段筛选条件中所述比对结果筛选出所述靶向测序结果中的跨越测序片段,以及基于命中片段筛选条件中所述比对结果中筛选出所述靶向测序结果的命中测序片段和强支持命中测序片段。
  7. 根据权利要求6所述的方法,其特征在于,所述基于跨越片段筛选条件中所述比对结果筛选出所述靶向测序结果中的跨越测序片段,包括:
    从所述测序片段中筛选出同时符合下述跨越片段筛选条件的跨越测序片段:
    对所述测序片段的左端测序片段长度、右端测序片段长度、左右 端测序片段的距离进行求和操作所得到的和值,大于所述测序片段的长度下四分位数与目标参数之间的乘积,所述目标参数为控制输出配对数量和严格程度的参数;
    所述测序片段中的左端测序片段和右端测序片段均不存在相似序列;
    所述测序片段中的左端测序片段和右端测序片段的多重比对值,包含正比对特征值、二次比对特征值,且不包含测序片段未配对特征值、另一测序片段未配对特征值。
  8. 根据权利要求6所述的方法,其特征在于,所述基于命中片段筛选条件中所述比对结果中筛选出所述靶向测序结果的命中测序片段和强支持命中测序片段,包括:
    从所述测序片段中筛选出同时符合下述命中片段筛选条件的命中测序片段:
    所述测序片段中每个位置的比对长度大于长度阈值,且所述比对长度大于所述测序片段的总长度的三分之一;
    所述测序片段的长度为所述比对长度的序列在所述比对结果中不存在相似序列;
    所述测序片段的比对次数质量值大于或等于比对次数;
    对于满足上述命中片段条件的命中测序片段,在左端测序片段和右端测序片段的比对位置重叠时,确定为强支持命中测序片段。
  9. 根据权利要求2所述的方法,其特征在于,所述根据所述命中测序片段和强支持命中测序片段的数量计算所述测序片段的断点位置,包括:
    将测序片段中命中测序片段的数量、所述强支持命中测序片段的数量的加权求和的最大值作为断点位置。
  10. 根据权利要求1所述的方法,其特征在于,在所述获取待鉴定的靶向基因测序序列以及参考基因序列之前,所述方法还包括:
    统计所获取到靶向基因测序序列中的碱基数量、碱基碱基质量和碱基长度;
    根据所述碱基数量、碱基质量和碱基长度识别所述靶向基因测序序列中的待过滤的序列,并对所述待过滤的序列进行过滤。
  11. 根据权利要求10所述的方法,其特征在于,所述根据所述碱 基数量、碱基碱基质量和碱基长度识别所述靶向基因测序序列中的接头序列,包括:
    将碱基质量为质量阈值、最小碱基长度为碱基长度阈值、测序序列平均质量值低于质量阈值的测序序列,作为待过滤的序列;
    将所包含左端测序序列或右端测序序列与所述接头序列重叠程度达到预设重叠程度的测序序列,补充到所述待过滤的序列中。
  12. 一种融合基因的鉴定装置,其特征在于,所述装置包括:
    获取模块,被配置为获取待鉴定的靶向基因测序序列以及参考基因序列;
    对比模块,被配置为将所述靶向基因测序序列和所述参考基因序列进行比对,获取所述靶向测序序列中位于靶向区域的跨越测序片段、命中测序片段的分布情况和靶向捕获结果;
    筛选模块,被配置为基于所述分布情况和靶向捕获结果从所述测序片段中筛选出目标融合基因对;
    输出模块,被配置为输出关于所述目标融合基因对的鉴定结果。
  13. 一种计算处理设备,其特征在于,包括:
    存储器,其中存储有计算机可读代码;
    一个或多个处理器,当所述计算机可读代码被所述一个或多个处理器执行时,所述计算处理设备执行如权利要求1-11中任一项所述的融合基因的鉴定方法。
  14. 一种计算机程序,其特征在于,包括计算机可读代码,当所述计算机可读代码在计算处理设备上运行时,导致所述计算处理设备执行如权利要求1-11中任一项所述的融合基因的鉴定方法。
  15. 一种非瞬态计算机可读介质,其特征在于,其中存储了如权利要求1-11中任一项所述的融合基因的鉴定方法的计算机程序。
PCT/CN2022/083275 2022-03-28 2022-03-28 融合基因的鉴定方法、装置、设备、程序及存储介质 WO2023184065A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/083275 WO2023184065A1 (zh) 2022-03-28 2022-03-28 融合基因的鉴定方法、装置、设备、程序及存储介质
CN202280000556.3A CN117136411A (zh) 2022-03-28 2022-03-28 融合基因的鉴定方法、装置、设备、程序及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/083275 WO2023184065A1 (zh) 2022-03-28 2022-03-28 融合基因的鉴定方法、装置、设备、程序及存储介质

Publications (1)

Publication Number Publication Date
WO2023184065A1 true WO2023184065A1 (zh) 2023-10-05

Family

ID=88198550

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/083275 WO2023184065A1 (zh) 2022-03-28 2022-03-28 融合基因的鉴定方法、装置、设备、程序及存储介质

Country Status (2)

Country Link
CN (1) CN117136411A (zh)
WO (1) WO2023184065A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180355423A1 (en) * 2017-06-12 2018-12-13 Grail, Inc. Alignment free filtering for identifying fusions
WO2019136364A1 (en) * 2018-01-05 2019-07-11 Illumina, Inc. Process for aligning targeted nucleic acid sequencing data
CN111108218A (zh) * 2017-09-20 2020-05-05 生命科技股份有限公司 使用压缩的分子标记的核酸序列数据检测融合的方法
CN111180013A (zh) * 2019-12-23 2020-05-19 北京橡鑫生物科技有限公司 检测血液病融合基因的装置
US20220073980A1 (en) * 2018-11-29 2022-03-10 Xgenomes Corp. Sequencing by coalescence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180355423A1 (en) * 2017-06-12 2018-12-13 Grail, Inc. Alignment free filtering for identifying fusions
CN111108218A (zh) * 2017-09-20 2020-05-05 生命科技股份有限公司 使用压缩的分子标记的核酸序列数据检测融合的方法
WO2019136364A1 (en) * 2018-01-05 2019-07-11 Illumina, Inc. Process for aligning targeted nucleic acid sequencing data
US20220073980A1 (en) * 2018-11-29 2022-03-10 Xgenomes Corp. Sequencing by coalescence
CN111180013A (zh) * 2019-12-23 2020-05-19 北京橡鑫生物科技有限公司 检测血液病融合基因的装置

Also Published As

Publication number Publication date
CN117136411A (zh) 2023-11-28

Similar Documents

Publication Publication Date Title
CN109033749B (zh) 一种肿瘤突变负荷检测方法、装置和存储介质
AU2020260534C1 (en) Using size and number aberrations in plasma DNA for detecting cancer
US10975445B2 (en) Integrated machine-learning framework to estimate homologous recombination deficiency
CN109022553B (zh) 用于肿瘤突变负荷检测的基因芯片及其制备方法和装置
CN107423578B (zh) 检测体细胞突变的装置
Liu et al. A review of bioinformatic methods for forensic DNA analyses
CN103993069B (zh) 病毒整合位点捕获测序分析方法
US11043283B1 (en) Systems and methods for automating RNA expression calls in a cancer prediction pipeline
CN109767810B (zh) 高通量测序数据分析方法及装置
CN107944228B (zh) 一种基因测序变异位点的可视化方法
JP2023524722A (ja) 遺伝子の突然変異及び発現量を検出する方法及び装置
JP7297774B2 (ja) 構造変異の分析
JP2015536661A (ja) 標的シーケンシングリードの正確かつ迅速なマッピング
CN106778073A (zh) 一种评估肿瘤负荷变化的方法和系统
CN111326212A (zh) 一种结构变异的检测方法
CN115083521B (zh) 一种单细胞转录组测序数据中肿瘤细胞类群的鉴定方法及系统
CN111292809B (zh) 用于检测rna水平基因融合的方法、电子设备和计算机存储介质
CN109461473B (zh) 胎儿游离dna浓度获取方法和装置
CN107967411B (zh) 一种脱靶位点的检测方法、装置及终端设备
US20210102199A1 (en) Fragment size characterization of cell-free dna mutations from clonal hematopoiesis
WO2023184065A1 (zh) 融合基因的鉴定方法、装置、设备、程序及存储介质
CN114990202B (zh) Snp位点在评估基因组异常的应用及评估基因组异常的方法
CN114067908B (zh) 一种评估单样本同源重组缺陷的方法、装置和存储介质
WO2023184330A1 (zh) 基因组甲基化测序数据的处理方法、装置、设备和介质
WO2019132010A1 (ja) 塩基配列における塩基種を推定する方法、装置及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22933924

Country of ref document: EP

Kind code of ref document: A1