CN111321202A

CN111321202A - Gene fusion variation library construction method, detection method, device, equipment and storage medium

Info

Publication number: CN111321202A
Application number: CN201911419273.9A
Authority: CN
Inventors: 黄晓强; 刘菲菲; 区小华; 陈禹欣; 杨娟; 赵薇薇; 于世辉; 赵纤纤; 冯菁华
Original assignee: Guangzhou Kingmed Diagnostics Group Co ltd
Current assignee: Guangzhou Kingmed Diagnostics Group Co ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-06-23

Abstract

The present invention relates to a gene fusion variation library construction method, detection method, device, computer equipment and computer storage medium. The above-mentioned gene fusion mutation library construction method, gene fusion mutation detection method and device of the present invention are based on the DNA probe hybridization capture multi-gene RNA targeted sequencing technology, and the fusion gene capture probe hybridization captures the target fusion gene to construct a gene fusion mutation library. The library can be used for high-throughput sequencing, and after bioinformatics analysis, the core genes that constitute breakpoints and their partner genes can be identified. Further, the present invention also designs a method for quantitative analysis of fusion genes, through which the variation ratio of fusion genes can be obtained, and then the accurate expression value of fusion genes can be obtained. The method for quantitative analysis of fusion genes is pioneering and solves the problem of NGS Quantitative analysis problems for detection of fusion genes.

Description

Gene fusion variant library construction method, detection method, device, equipment and storage medium

技术领域technical field

本发明涉及分子生物学及生物信息学技术领域，尤其是涉及一种基因融合变异文库构建方法、检测方法、装置、设备及存储介质。The present invention relates to the technical field of molecular biology and bioinformatics, in particular to a gene fusion variant library construction method, detection method, device, equipment and storage medium.

背景技术Background technique

细胞遗传学研究发现，在一系列血液肿瘤，包括AML、ALL、CML与NHLs 等存在多发的染色体易位，导致癌基因的异常表达与/或融合基因的转录表达，皆促进癌细胞转化与生存。这些核心驱动基因(如MLL，ALK等)往往存在多个融合基因伴侣(partner)，而且与同一融合基因也可能有不同的断裂点 (breakpoint)，从而形成不同亚型，例如MLL基因存在54种已知的融合伴侣，且COSMIC数据库收录KMT2A -AFF1融合基因有多达15种融合亚型(https://cancer.sanger.ac.uk/cosmic/fusion/overview？fid＝359723&gid＝271430)。这些融合基因变异影响临床预后，同时能够指导血液肿瘤的分子分型与靶向治疗。因此开发一种基因组检测试剂以鉴定血液肿瘤中的基因融合变异是当下的未竟之需。Cytogenetic studies have found that there are multiple chromosomal translocations in a series of hematological tumors, including AML, ALL, CML, and NHLs, which lead to abnormal expression of oncogenes and/or transcriptional expression of fusion genes, which promote the transformation and survival of cancer cells. . These core driver genes (such as MLL, ALK, etc.) often have multiple fusion gene partners, and may also have different breakpoints with the same fusion gene, thus forming different subtypes. For example, there are 54 MLL genes. Known fusion partners, and the COSMIC database includes up to 15 fusion isoforms of the KMT2A-AFF1 fusion gene (https://cancer.sanger.ac.uk/cosmic/fusion/overview?fid=359723&gid=271430). These fusion gene variants affect clinical prognosis and can guide molecular typing and targeted therapy of hematological tumors. Therefore, the development of a genomic detection reagent to identify gene fusion variants in hematological tumors is an unfinished need.

RT-PCR与荧光原位杂交(FISH)是常用的两种基因融合检测技术。两者均检测单一特定类型的已知基因融合，适用范围窄且效率低，更无法检测新发的基因融合变异。因此融合基因检测技术的不足仍然限制了血液肿瘤的辅助诊断与精准医疗。RT-PCR and fluorescence in situ hybridization (FISH) are two commonly used gene fusion detection techniques. Both detect a single specific type of known gene fusion, which is narrow in scope and low in efficiency, and cannot detect new gene fusion variants. Therefore, the lack of fusion gene detection technology still limits the auxiliary diagnosis and precision medicine of hematological tumors.

发明内容SUMMARY OF THE INVENTION

基于此，有必要提供一种适用范围宽、检测效率高且能够检测新发的基因融合变异的基因融合变异文库构建方法、检测方法、装置、计算机设备及计算机存储介质。Based on this, it is necessary to provide a gene fusion mutation library construction method, detection method, device, computer equipment and computer storage medium that have a wide range of applications, high detection efficiency and can detect new gene fusion mutations.

一种基因融合变异文库构建方法，包括如下步骤：A method for constructing a gene fusion variant library, comprising the following steps:

提取样本总RNA，并去除其中的rRNA；Extract the total RNA of the sample and remove the rRNA;

将去除rRNA后的总RNA逆转录并合成双链cDNA，在合成所述双链cDNA 的第二条链时使用dUTP代替dTTP进行合成；Reverse transcription of the total RNA after removing the rRNA to synthesize double-stranded cDNA, and use dUTP instead of dTTP to synthesize the second strand of the double-stranded cDNA;

对合成的所述双链cDNA进行末端修复和添加连接接头；performing end repair on the synthesized double-stranded cDNA and adding a ligation linker;

酶切消化末端修复和添加连接接头后的双链DNA中的dUTP，使所述双链 cDNA产生缺口；Enzymatic digestion digests the dUTP in the double-stranded DNA after end repair and addition of the ligation linker, so that the double-stranded cDNA is nicked;

扩增酶切消化后的所述双链DNA，构建cDNA预文库；Amplify the double-stranded DNA after digestion and digestion to construct a cDNA pre-library;

使用融合基因捕获探针杂交捕获所述cDNA预文库中的目标融合cDNA，所述目标融合cDNA是由至少两个不同基因融合构成的，所述融合基因捕获探针含有能够与所述目标融合cDNA的其中一个基因的序列互补配对的序列；Use a fusion gene capture probe to hybridize and capture the target fusion cDNA in the cDNA pre-library, the target fusion cDNA is composed of fusion of at least two different genes, and the fusion gene capture probe contains a cDNA capable of being fused with the target The sequence of the complementary pairing of the sequences of one of the genes;

对捕获的所述目标融合cDNA进行扩增，得到所述基因融合变异文库。Amplify the captured target fusion cDNA to obtain the gene fusion variant library.

在其中一个实施例中，所述融合基因捕获探针的设计原则如下：In one embodiment, the design principles of the fusion gene capture probe are as follows:

(1)所述融合基因捕获探针是针对目标融合cDNA中的核心基因进行设计，所述核心基因是指有多个基因伴侣且易发生融合变异的基因，或者是细胞生长或增值信号通路中的关键基因，或者是驱动基因；(1) The fusion gene capture probe is designed for the core gene in the target fusion cDNA, and the core gene refers to a gene that has multiple gene partners and is prone to fusion mutation, or is in the cell growth or proliferation signaling pathway. key genes, or driver genes;

(2)所述融合基因捕获探针是针对所述核心基因的转录本序列设计；(2) the fusion gene capture probe is designed for the transcript sequence of the core gene;

(3)所述融合基因捕获探针是针对hg19参考基因组中的核心基因设计，覆盖密度为2×覆瓦式序列；(3) The fusion gene capture probe is designed for the core gene in the hg19 reference genome, and the coverage density is 2 × tiling sequences;

(4)所述融合基因捕获探针的长度为120bp；(4) the length of the fusion gene capture probe is 120bp;

(5)所述融合基因捕获探针在设计时需比对至人转录组序列，统计所有 Blast匹配的数目，若Blast匹配的数量不大于50则说明合格，若Blast匹配的数量大于50，则以替换错配碱基的方式重新设计，直至获得对目的基因序列有最高的匹配性且Blast匹配的数量不大于50。(5) The fusion gene capture probe needs to be compared to the human transcriptome sequence during design, and the number of all Blast matches is counted. If the number of Blast matches is not greater than 50, it is qualified. If the number of Blast matches is greater than 50, then Redesign by replacing mismatched bases until the highest match to the target gene sequence is obtained and the number of Blast matches is not more than 50.

在其中一个实施例中，所述融合基因捕获探针的5’端标记有用于捕获的连接物；In one embodiment, the 5' end of the fusion gene capture probe is labeled with a linker for capture;

可选地，所述连接物为生物素或链霉亲和素。Optionally, the linker is biotin or streptavidin.

在其中一个实施例中，所述样本总RNA为外周血或者骨髓样本的总RNA。In one embodiment, the total RNA of the sample is total RNA of peripheral blood or bone marrow samples.

在其中一个实施例中，所述末端修复是在合成的所述双链cDNA的3’末端添加一个dATP；In one embodiment, the end repair is the addition of a dATP to the 3' end of the synthesized double-stranded cDNA;

所述添加连接接头引入的接头格式是P5-Real1primer-DNAINSERT-IndexReadprimer-index-P7，具体是： 5’AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCT TCCGATC*T-待测DNA片段序列-GTTCGTCTTCTGCCGTATGCTCTA-index-C ACTGACCTCAAGTCTGCACACGAGAAGGCTAG-P，其中，P5和P7为接头， Real1primer和IndexReadprimer为引物序列，DNAINSERT是待测DNA片段序列，index为12nt的独有样本标签，p为磷酸基团。The linker format introduced by the added linker is P5-Real1primer-DNAINSERT-IndexReadprimer-index-P7, specifically: 5'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCGATC*T-DNA fragment sequence to be tested-GTTCGTCTTCTGCCGTATGCTCTA-index-C ACTGACCTCAAGTCTGCACACGAGAAGGCTAG-P, wherein P5 And P7 is the linker, Real1primer and IndexReadprimer are the primer sequences, DNAINSERT is the DNA fragment sequence to be tested, index is the unique sample tag of 12nt, and p is the phosphate group.

在其中一个实施例中，所述扩增酶切消化后的所述双链DNA以及对捕获的所述目标融合cDNA进行扩增是使用与接头P5和P7序列配对的引物进行扩增。In one embodiment, the amplification of the double-stranded DNA digested by the amplifying enzyme and the amplification of the captured target fusion cDNA is performed using primers paired with the sequences of linkers P5 and P7.

一种基因融合变异检测方法，包括如下步骤：A gene fusion mutation detection method, comprising the following steps:

获取基因融合变异文库的测序数据，所述基因融合变异文库是通过融合基因捕获探针来杂交捕获待测样本的转录序列所得到的目标融合基因的扩增文库，所述目标融合基因是由至少两个不同基因融合构成的，所述融合基因捕获探针含有能够与所述目标融合基因的其中一个基因的序列互补配对的序列；Obtain the sequencing data of the gene fusion variant library, the gene fusion variant library is an amplification library of the target fusion gene obtained by hybridizing and capturing the transcription sequence of the sample to be tested by the fusion gene capture probe, and the target fusion gene is composed of at least one. Composed of two different genes fused, the fusion gene capture probe contains a sequence capable of complementary pairing with the sequence of one of the target fusion genes;

将所述测序数据与人类转录组和基因组数据进行比对，筛选能够同时匹配到至少两个基因的reads；Comparing the sequencing data with human transcriptome and genome data, and screening reads that can match at least two genes at the same time;

分析所述能够同时匹配到至少两个基因的reads是否满足预设的阈值要求，如果满足，则说明该reads所包含的多个基因发生了基因融合。It is analyzed whether the reads that can match at least two genes at the same time meet the preset threshold requirements, and if so, it means that a plurality of genes included in the reads have undergone gene fusion.

在其中一个实施例中，在所述将所述测序数据与人类转录组和基因组数据进行比对，筛选能够同时匹配到至少两个基因的reads的步骤之前还包括：In one embodiment, before the step of comparing the sequencing data with the human transcriptome and genome data, and screening reads that can match at least two genes at the same time, the step further includes:

对所述测序数据进行质量评估，剔除低质量reads，得到干净的测序数据。The quality of the sequencing data is evaluated, and low-quality reads are eliminated to obtain clean sequencing data.

在其中一个实施例中，所述剔除低质量reads包括：In one embodiment, the culling of low-quality reads comprises:

去除含接头序列的reads；Remove reads containing linker sequences;

去掉质量值低于15的低质量碱基占比≧50％的reads；Remove reads with low-quality bases with a quality value below 15 accounting for ≧50%;

去掉含N占比大于1％的reads。Reads containing more than 1% of N were removed.

在其中一个实施例中，还包括将所述测序数据与人类转录组和基因组数据进行比对之后按照预设的控制标准剔除所述干净的测序数据中假阳性事件的步骤；In one of the embodiments, it further includes the step of eliminating false positive events in the clean sequencing data according to a preset control standard after comparing the sequencing data with the human transcriptome and genome data;

具体地，对筛选得出的基因融合变异事件进行注释，去伪存真，对符合以下标准的基因融合变异事件以剔除：Specifically, annotate the gene fusion mutation events obtained by screening, remove the false and preserve the true, and eliminate the gene fusion mutation events that meet the following criteria:

融合基因的不同基因之间彼此互为旁系同源；The different genes of the fusion gene are paralogous to each other;

融合基因的不同基因为假基因；The different genes of the fusion gene are pseudogenes;

该基因融合变异已经在正常健康人中检出。The gene fusion variant has been detected in normal healthy people.

在其中一个实施例中，所述预设的阈值要求是指：若该融合基因变异具有临床意义，则同时匹配到该两个基因的唯一spanning reads超3个；若该融合基因变异是临床意义未明，则同时匹配到该两个基因的唯一spanning reads超10 个。In one embodiment, the preset threshold requirement refers to: if the fusion gene mutation has clinical significance, more than 3 unique spanning reads matching the two genes at the same time; if the fusion gene mutation is clinically significant If it is not clear, there are more than 10 unique spanning reads matching the two genes at the same time.

在其中一个实施例中，还包括：In one embodiment, it also includes:

按照如下公式计算融合基因的变异比例：Calculate the mutation ratio of the fusion gene according to the following formula:

其中，

in,

所述fusion supporting read pairs是指支持该基因融合的reads对数；The fusion supporting read pairs refers to the number of read pairs supporting the gene fusion;

所述#mappable reads是指比对上基因组的reads条数；Described #mappable reads refers to the number of reads of the genome in comparison;

所述weighted-average of Insertsize-read length是指文库插入cDNA片段的加权平均长度；The weighted-average of Insertsize-read length refers to the weighted average length of the inserted cDNA fragments in the library;

所述refgeneFPKM为内参基因的归一化表达值；The refgeneFPKM is the normalized expression value of the internal reference gene;

所述FPKM定义为Reads Per Kilobase of exon model per Million mappedreads，即每1百万个比对上的reads中比对到某外显子的每1K个碱基上的reads 个数。The FPKM is defined as Reads Per Kilobase of exon model per Million mappedreads, that is, the number of reads aligned to every 1K bases of an exon in every 1 million aligned reads.

一种基因融合变异检测装置，包括：A gene fusion mutation detection device, comprising:

测序数据获取模块，用于获取基因融合变异文库的测序数据，所述基因融合变异文库是通过融合基因捕获探针来杂交捕获待测样本的转录序列所得到的目标融合基因的扩增文库，所述目标融合基因是由至少两个不同基因融合构成的，所述融合基因捕获探针含有能够与所述目标融合基因的其中一个基因的序列互补配对的序列；The sequencing data acquisition module is used to obtain the sequencing data of the gene fusion variant library. The gene fusion variant library is an amplification library of the target fusion gene obtained by hybridizing and capturing the transcription sequence of the sample to be tested by using the fusion gene capture probe. The target fusion gene is composed of fusion of at least two different genes, and the fusion gene capture probe contains a sequence capable of complementary pairing with the sequence of one of the target fusion genes;

比对筛选模块，用于将所述测序数据与人类转录组和基因组数据进行比对，筛选能够同时匹配到至少两个基因的reads；以及an alignment screening module for aligning the sequencing data with human transcriptome and genome data, and screening reads that can match at least two genes at the same time; and

融合分析模块，用于分析所述能够同时匹配到至少两个基因的reads是否满足预设的阈值要求，如果满足，则说明该reads所包含的多个基因发生了基因融合。The fusion analysis module is used to analyze whether the reads that can match at least two genes at the same time meet the preset threshold requirements.

在其中一个实施例中，还包括：In one embodiment, it also includes:

变异比例计算模块，用于按照如下公式计算融合基因的变异比例：The variation ratio calculation module is used to calculate the variation ratio of the fusion gene according to the following formula:

其中，

in,

一种计算机设备，具有处理器和存储器，所述存储器存储有计算机程序，所述处理器执行所述计算机程序时实现上述任一实施例所述的基因融合变异检测方法的步骤。A computer device has a processor and a memory, the memory stores a computer program, and when the processor executes the computer program, the steps of the gene fusion mutation detection method described in any of the above embodiments are implemented.

一种计算机存储介质，其上存储有计算机程序，所述计算机程序被执行时实现上述任一实施例所述的基因融合变异检测方法的步骤。A computer storage medium on which a computer program is stored, and when the computer program is executed, implements the steps of the gene fusion mutation detection method described in any one of the above embodiments.

单一驱动基因可以与其他多个基因(伴侣基因)发生基因融合，融合基因转录之后形成核心基因外显子与伴侣基因外显子的接合(即断裂点)。本发明的上述基因融合变异文库构建方法、基因融合变异检测方法及装置基于DNA探针杂交捕获多基因RNA靶向测序技术，通过融合基因捕获探针杂交捕获目标融合基因，构建基因融合变异文库，该文库可用于高通量测序，经过生物信息学分析，可以鉴定构成断裂点的核心基因及其伴侣基因。A single driver gene can be genetically fused with multiple other genes (partner genes), and after the fusion gene is transcribed, a junction (ie, breakpoint) between the core gene exon and the partner gene exon is formed. The above-mentioned gene fusion mutation library construction method, gene fusion mutation detection method and device of the present invention are based on the DNA probe hybridization capture multi-gene RNA targeted sequencing technology, and the fusion gene capture probe hybridization captures the target fusion gene to construct a gene fusion mutation library. The library can be used for high-throughput sequencing, and after bioinformatics analysis, the core genes that constitute breakpoints and their partner genes can be identified.

该基因融合变异文库构建方法、基因融合变异检测方法及装置可用于检测多种血液肿瘤热点融合基因相关的已知或新发的基因重排、基因缺失与基因重复等基因变等信息。本发明的技术构思与传统的例如荧光定量法比较，更全面、高效，同时兼具效率与经济。The gene fusion mutation library construction method, gene fusion mutation detection method and device can be used to detect information such as known or new gene rearrangements, gene deletions, gene duplications and other gene mutations related to a variety of blood tumor hot spot fusion genes. Compared with traditional methods such as fluorescence quantitative methods, the technical concept of the present invention is more comprehensive and efficient, and has both efficiency and economy.

进一步，本发明还设计一种融合基因定量分析方法，通过计算可以得到融合基因的变异比例，进而可以得到融合基因的准确的表达量值，该融合基因定量分析方法具有开创性，解决了NGS法检测融合基因的定量分析问题。Further, the present invention also designs a method for quantitative analysis of fusion genes, through which the variation ratio of fusion genes can be obtained, and then the accurate expression value of fusion genes can be obtained. The method for quantitative analysis of fusion genes is pioneering and solves the problem of NGS Quantitative analysis problems for detection of fusion genes.

附图说明Description of drawings

图1为本发明一实施例的融合基因变异检测方法的流程示意图；1 is a schematic flowchart of a fusion gene mutation detection method according to an embodiment of the present invention;

图2为本发明一实施例的融合基因变异检测装置的模块结构示意图。FIG. 2 is a schematic structural diagram of a module of a fusion gene mutation detection device according to an embodiment of the present invention.

具体实施方式Detailed ways

为了便于理解本发明，下面将参照相关附图对本发明进行更全面的描述。附图中给出了本发明的较佳实施例。但是，本发明可以以许多不同的形式来实现，并不限于本文所描述的实施例。相反地，提供这些实施例的目的是使对本发明的公开内容的理解更加透彻全面。In order to facilitate understanding of the present invention, the present invention will be described more fully hereinafter with reference to the related drawings. Preferred embodiments of the invention are shown in the accompanying drawings. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that a thorough and complete understanding of the present disclosure is provided.

除非另有定义，本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。本文中在本发明的说明书中所使用的术语只是为了描述具体的实施例的目的，不是旨在于限制本发明。本文所使用的术语“和/或”包括一个或多个相关的所列项目的任意的和所有的组合。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terms used herein in the description of the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

本文所述融合基因是指不同基因坐标上的基因通过染色体重排等机制拼接到一起并且转录形成新的融合蛋白的基因，其表示形式为基因A/基因B，或基因A-基因B，如BCR-ABL1，基因A与基因B互为融合基因伴侣。The fusion gene described herein refers to the gene in which genes on different gene coordinates are spliced together through chromosomal rearrangement and other mechanisms and are transcribed to form a new fusion protein, which is expressed in the form of gene A/gene B, or gene A-gene B, such as BCR-ABL1, gene A and gene B are fusion gene partners.

所选基因系关键的核心融合基因，所述核心基因是指该基因发生融合变异的频率较高，研究发现其有多种融合基因伴侣，或是指细胞生长或增殖信号通路中的关键基因，或是驱动基因(driver gene)。The selected gene is a key core fusion gene. The core gene refers to the gene with a high frequency of fusion and mutation. Studies have found that it has a variety of fusion gene partners, or refers to a key gene in the cell growth or proliferation signaling pathway. Or a driver gene.

所述的“reads”是指高通量测序得到的序列片段。The "reads" refer to sequence fragments obtained by high-throughput sequencing.

所述的测序质量是指read序列中碱基的准确程度。The sequencing quality refers to the accuracy of the bases in the read sequence.

所述的“人类转录组”是人细胞中所有基因表达的产物组合。The "human transcriptome" is the combined product of all gene expression in human cells.

所述的人类基因组是hg19。The human genome is hg19.

所述旁系同源(Paralogs)是那些在一定物种中的来源于基因复制的蛋白，可能会进化出新的与原来有关的功能。用来描述在同一物种内由于基因复制而分离的同源基因。The paralogs (Paralogs) are those proteins derived from gene duplication in a certain species, which may evolve new functions related to the original. Used to describe homologous genes that have separated due to gene duplication within the same species.

所述假基因可视为基因组中与编码基因序列非常相似的非功能性基因组 DNA拷贝。The pseudogene can be viewed as a non-functional copy of genomic DNA in the genome that closely resembles the coding gene sequence.

所述Body Map 2.0是一组人正常组织的转录组测序数据。The Body Map 2.0 is a set of transcriptome sequencing data of human normal tissues.

所述的“基因距离”是指两个基因的基因坐标之间的间距。The "gene distance" refers to the distance between the gene coordinates of two genes.

本发明提供了一种基因融合变异文库构建方法，其包括如下步骤：The invention provides a method for constructing a gene fusion variant library, which comprises the following steps:

将去除rRNA后的总RNA逆转录合成双链cDNA，在合成双链cDNA的第二条链时使用dUTP代替dTTP进行合成；The total RNA after removing rRNA is reverse transcribed to synthesize double-stranded cDNA, and dUTP is used instead of dTTP to synthesize the second strand of double-stranded cDNA;

对合成的双链cDNA进行末端修复和添加连接接头；End repair and addition of ligation adapters to the synthesized double-stranded cDNA;

酶切消化末端修复和添加连接接头后的双链DNA中的dUTP，使双链cDNA 产生缺口；Digestion of the dUTP in the double-stranded DNA after end repair and addition of the ligation linker, resulting in gaps in the double-stranded cDNA;

扩增酶切消化后的双链DNA，构建cDNA预文库；Amplify the digested double-stranded DNA to construct a cDNA pre-library;

使用融合基因捕获探针杂交捕获cDNA预文库中的目标融合cDNA，目标融合cDNA是由至少两个不同基因融合构成的，融合基因捕获探针含有能够与目标融合cDNA的其中一个基因的序列互补配对的序列；Use the fusion gene capture probe to hybridize and capture the target fusion cDNA in the cDNA pre-library. The target fusion cDNA is composed of fusion of at least two different genes. The fusion gene capture probe contains a sequence complementary pairing with one of the genes of the target fusion cDNA. the sequence of;

对捕获的目标融合cDNA进行扩增，得到基因融合变异文库。Amplify the captured target fusion cDNA to obtain a gene fusion variant library.

在一个具体示例中，样本总RNA为外周血或者骨髓样本的总RNA。在提取样本的总RNA后，优选地，还包括测定核酸浓度以及A260/A280值的步骤。In a specific example, the total RNA of the sample is the total RNA of peripheral blood or bone marrow samples. After the total RNA of the sample is extracted, preferably, the step of determining nucleic acid concentration and A260/A280 value is also included.

在一个具体示例中，所述去除其中的rRNA是将总RNA与rRNA合成单链 DNA探针杂交，通过rRNA合成单链DNA探针与总RNA中的rRNA杂交结合，而将rRNA去除。In a specific example, the removal of the rRNA is by hybridizing the total RNA with a rRNA-synthesized single-stranded DNA probe, and the rRNA is removed by hybridizing the rRNA-synthesized single-stranded DNA probe with the rRNA in the total RNA.

在一个具体示例中，末端修复是在合成的双链cDNA的3’末端添加一个 dATP；添加连接接头引入的接头格式是： P5-Real1primer-DNAINSERT-IndexReadprimer-index-P7。具体地，该接头序列是： 5’AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC*T-待测DNA片段序列-GTTCGTCTTCTGCCGTATGCTCTA-index-C ACTGACCTCAAGTCTGCACACGAGAAGGCTAG-P，其中，P5 (5'-AATGATACGGCGACCACCGA-3'，SEQ ID NO:1)和P7 (5'-CAAGCAGAAGACGGCATACGAGAT-3'，SEQ ID NO:2)为接头， Real1primer(GATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT，SEQ ID NO:3)和IndexReadprimer(GTTCGTCTTCTGCCGTATGCTCTA，SEQ ID NO:4) 为引物序列，DNAINSERT是待测DNA片段序列，index为12nt的独有样本标签，p为磷酸基团。In a specific example, end repair is the addition of a dATP to the 3' end of the synthesized double-stranded cDNA; the addition of a ligation adaptor introduces the adaptor format: P5-Real1primer-DNAINSERT-IndexReadprimer-index-P7. Specifically, the linker sequence is: 5'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC*T-DNA fragment sequence to be tested-GTTCGTCTTCTGCCGTATGCTCTA-index-CACTGACCTCAAGTCTGCACACGAGAAGGCTAG-P, wherein P5 (5'-AATGATACGGCGACCACCGA-3', SEQ ID NO: 1) and P7 ( 5'-CAAGCAGAAGACGGCATACGAGAT-3', SEQ ID NO:2) is a linker, Real1primer (GATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, SEQ ID NO:3) and IndexReadprimer (GTTCGTCTTCTGCCGTATGCTCTA, SEQ ID NO:4) are primer sequences, and DNAINSERT is the DNA fragment sequence to be tested, index is a unique sample tag of 12nt, and p is a phosphate group.

在一个具体示例中，融合基因捕获探针的设计原则如下：(1)所述融合基因捕获探针是针对目标融合cDNA中的核心基因进行设计，所述核心基因是指有多个基因伴侣且易发生融合变异的基因，或者是细胞生长或增值信号通路中的关键基因，或者是驱动基因；In a specific example, the design principles of the fusion gene capture probe are as follows: (1) The fusion gene capture probe is designed for the core gene in the target fusion cDNA, and the core gene refers to that there are multiple gene partners and Genes prone to fusion mutations, or key genes in cell growth or proliferation signaling pathways, or driver genes;

(3)所述融合基因捕获探针是针对hg19参考基因组中的核心基因设计，覆盖密度为2×覆瓦式序列(2×tiling)；(3) The fusion gene capture probe is designed for the core gene in the hg19 reference genome, and the coverage density is 2×tiling sequence (2×tiling);

(5)所述融合基因捕获探针在设计时需比对至人转录组序列，统计所有 Blast匹配(BLAST hits)的数目，若Blast匹配的数量不大于50则说明合格，若Blast匹配的数量大于50，则以替换错配碱基的方式重新设计，直至获得对目的基因序列有最高的匹配性且Blast匹配的数量不大于50。(5) The fusion gene capture probe needs to be compared to the human transcriptome sequence during design, and the number of all Blast hits (BLAST hits) is counted. If the number of Blast matches is not greater than 50, it is qualified. If the number of Blast matches If it is greater than 50, it will be redesigned by replacing the mismatched bases until the highest match to the target gene sequence is obtained and the number of Blast matches is not greater than 50.

例如，在一些具体示例中，可以针对血液系统肿瘤(白血病与淋巴瘤)选择54核心基因，即ABL1、CREBBP、CRLF2、MECOM、TP53、TSLP、LMO2、 PRDM16、MYC、ETV6、RARA、NUP214、BCL6、MYB、IRF4、CBFB、CEBPB、 ZNF384、RUNX1、FGFR3、MALT1、ERG、NPM1、PAX5、JAK2、PICALM、FLT3、GLIS2、PDGFRB、PML、TLX1、ITK、FGFR1、IL2RB、TAL1、WT1、 NTRK3、NUP98、EPOR、RBM15、CSF1R、KMT2A、BCL2、BCR、LYN、TLX3、 CCND1、TCF3、CEBPA、ABL2、ALK、PDGFRA、IGLL5、IGHA2，从Ensembl 数据库获取转录本序列号，根据其序列设计重叠式2×覆瓦式序列(tiling)(探针5’端标记生物素用于捕获)，得到探针库。For example, in some specific examples, 54 core genes can be selected for hematological tumors (leukemia and lymphoma), namely ABL1, CREBBP, CRLF2, MECOM, TP53, TSLP, LMO2, PRDM16, MYC, ETV6, RARA, NUP214, BCL6 , MYB, IRF4, CBFB, CEBPB, ZNF384, RUNX1, FGFR3, MALT1, ERG, NPM1, PAX5, JAK2, PICALM, FLT3, GLIS2, PDGFRB, PML, TLX1, ITK, FGFR1, IL2RB, TAL1, WT1, NTRK3, NUP98 , EPOR, RBM15, CSF1R, KMT2A, BCL2, BCR, LYN, TLX3, CCND1, TCF3, CEBPA, ABL2, ALK, PDGFRA, IGLL5, IGHA2, obtain transcript sequence numbers from Ensembl database, and design overlapping 2× Tiling (the 5' end of the probe is labeled with biotin for capture), resulting in a library of probes.

进一步，融合基因捕获探针的5’端标记有用于捕获的连接物，例如标记有生物素或链霉亲和素等用于固定在基底上的连接物。Further, the 5' end of the fusion gene capture probe is labeled with a linker for capturing, for example, a linker labeled with biotin or streptavidin for immobilizing on the substrate.

在一个具体示例中，扩增酶切消化后的双链DNA以及对捕获的目标融合cDNA进行扩增是使用与接头P5和P7序列配对的引物进行扩增。In a specific example, amplifying the digested double-stranded DNA and amplifying the captured target fusion cDNA is performed using primers paired with the sequences of linkers P5 and P7.

如图1所示，本发明还提供了一种基因融合变异检测方法，其包括如下步骤：As shown in Figure 1, the present invention also provides a gene fusion mutation detection method, which comprises the following steps:

步骤S110：获取基因融合变异文库的测序数据，基因融合变异文库是通过融合基因捕获探针来杂交捕获待测样本的转录序列所得到的目标融合基因的扩增文库，目标融合基因是由至少两个不同基因融合构成的，融合基因捕获探针含有能够与目标融合基因的其中一个基因的序列互补配对的序列；Step S110: Obtain the sequencing data of the gene fusion variant library. The gene fusion variant library is an amplified library of the target fusion gene obtained by hybridizing and capturing the transcription sequence of the sample to be tested by using the fusion gene capture probe. The target fusion gene is composed of at least two The fusion gene capture probe contains a sequence that can be complementary to the sequence of one of the target fusion genes;

步骤S120：将测序数据与人类转录组和基因组数据进行比对，筛选能够同时匹配到至少两个基因的reads；Step S120: Compare the sequencing data with the human transcriptome and genome data, and screen reads that can match at least two genes at the same time;

步骤S130：分析能够同时匹配到至少两个基因的reads是否满足预设的阈值要求，如果满足，则说明该reads所包含的多个基因发生了基因融合。Step S130: Analyze whether the reads that can match at least two genes at the same time meet the preset threshold requirements, and if so, it means that multiple genes included in the reads have undergone gene fusion.

在一个具体示例中，可以使用但不限于Novaseq 6000高通量测序仪对基因融合变异文库进行高通量测序，测序深度可以是但不限于5000X。In a specific example, a Novaseq 6000 high-throughput sequencer can be used to perform high-throughput sequencing on the gene fusion variant library, and the sequencing depth can be, but not limited to, 5000X.

在一个具体示例中，在将测序数据与人类转录组和基因组数据进行比对，筛选能够同时匹配到至少两个基因的reads的步骤之前还包括：In a specific example, before the step of comparing the sequencing data with the human transcriptome and genome data, and screening reads that can match at least two genes at the same time, the step further includes:

对测序数据进行质量评估，剔除低质量reads，得到干净的测序数据。The quality of the sequencing data is evaluated, and low-quality reads are eliminated to obtain clean sequencing data.

具体地，可以使用但不限于bcl2fastq软件对原始数据转换得到raw fastq文件，经fastQC软件对raw fastq数据进行质量评估，可利用但不限于Trimmomatic 软件剔除低质量reads，得到所述干净的测序数据。Specifically, raw data can be converted to raw fastq files by using but not limited to bcl2fastq software, and the quality of raw fastq data can be assessed by fastQC software, and low-quality reads can be eliminated by using but not limited to Trimmomatic software to obtain the clean sequencing data.

进一步，在一个具体示例中，所述剔除低质量reads包括：Further, in a specific example, the culling of low-quality reads includes:

去除含接头序列的reads；Remove reads containing linker sequences;

在一个具体示例中，基因融合变异检测方法还包括将测序数据与人类转录组和基因组数据进行比对之后按照预设的控制标准剔除干净的测序数据中假阳性事件的步骤；In a specific example, the gene fusion variant detection method further includes the step of eliminating false positive events in the clean sequencing data according to preset control standards after comparing the sequencing data with the human transcriptome and genome data;

该基因融合变异已经在正常健康人中检出(如Body Map 2.0是一个正常人组织的转录组数据集，分析该数据检出的基因融合变异判定为假阳性。)。The gene fusion variant has been detected in normal healthy people (for example, Body Map 2.0 is a transcriptome data set of normal human tissue, and the gene fusion variant detected by analyzing the data is judged as false positive.).

具体地，可以使用但不限于BOWTIE、STAR、SPOTLIGHT等软件将所有 reads与人类转录组和基因组比对，筛选同时匹配到两个基因的转录本的reads。然后通过一系列标准，如旁系同源(paralog)、假基因、Body Map 2.0、基因距离等剔除假阳性事件。如果同时匹配到某两个基因的reads超过预设的阈值要求，就认定这两个基因发生了基因融合。Specifically, software such as, but not limited to, BOWTIE, STAR, SPOTLIGHT, etc., can be used to align all reads with the human transcriptome and genome, and screen reads that match the transcripts of the two genes at the same time. False positive events were then eliminated by a series of criteria, such as paralog, pseudogene, Body Map 2.0, gene distance, etc. If the reads matching two genes at the same time exceed the preset threshold, it is determined that the two genes have undergone gene fusion.

更具体地，预设的阈值要求是指：若该融合基因变异具有临床意义，则同时匹配到该两个基因的唯一spanning reads超3个(spanning read是指比对到基因融合交接处(junction)的reads)；若该融合基因变异是临床意义未明，则同时匹配到该两个基因的唯一spanning reads超10个。More specifically, the preset threshold requirement refers to: if the fusion gene mutation has clinical significance, then the unique spanning reads that match the two genes at the same time exceed 3 (spanning read refers to the alignment to the gene fusion junction (junction). ) reads); if the fusion gene variation is of unknown clinical significance, the unique spanning reads matched to the two genes at the same time exceed 10.

进一步，本发明提供的基因融合变异检测方法还包括按照如下公式计算融合基因的变异比例：Further, the gene fusion mutation detection method provided by the present invention also includes calculating the mutation ratio of the fusion gene according to the following formula:

其中，

in,

所述FPKM定义为Reads Per Kilobase of exon model per Million mappedreads，即每1百万(10⁹)个比对上的reads中比对到某外显子的每1K个碱基上的reads个数。The FPKM is defined as Reads Per Kilobase of exon model per Million mappedreads, that is, the number of reads aligned to every 1K bases of an exon in every 1 million (10 ⁹ ) aligned reads.

这是一个基因转录本的量化模型，根据stringtie软件计算得到，主要是针对pair-end测序表达量进行计算。FPKM和RPKM的区别就是一个是fragment，一个是read。对于单末端测序数据，由于Cufflinks计算的时候是将一个read当做一个fragment来算的，故而FPKM等同于RPKM(RPKM＝total exon reads/ (mapped reads(Millions)*exon length(KB)))。对于双末端测序而言，如果一对 paired-read都比对上了，那么这一对paired-read称之为一个fragment，而如果一对paired-Read中只有一个比对上了，另外一个没有比对上，那么就将这个比对上的read称之为一个fragment。This is a quantitative model of gene transcripts, calculated according to stringtie software, mainly for pair-end sequencing expression. The difference between FPKM and RPKM is that one is fragment and the other is read. For single-end sequencing data, since Cufflinks calculates a read as a fragment, FPKM is equivalent to RPKM (RPKM=total exon reads/ (mapped reads(Millions)*exon length(KB))). For paired-end sequencing, if a pair of paired-reads are aligned, then this paired-read pair is called a fragment, and if only one paired-read pair is aligned, the other is not. In comparison, the read in this comparison is called a fragment.

基于与上述检测方法相同的思想，如图2所示，本发明还提供了一种基因融合变异检测装置200，其包括：Based on the same idea as the above detection method, as shown in FIG. 2 , the present invention also provides a gene fusion mutation detection device 200, which includes:

测序数据获取模块210，用于获取基因融合变异文库的测序数据，基因融合变异文库是通过融合基因捕获探针来杂交捕获待测样本的转录序列所得到的目标融合基因的扩增文库，目标融合基因是由至少两个不同基因融合构成的，融合基因捕获探针含有能够与目标融合基因的其中一个基因的序列互补配对的序列；The sequencing data acquisition module 210 is used for acquiring the sequencing data of the gene fusion variant library. The gene fusion variant library is an amplification library of the target fusion gene obtained by hybridizing and capturing the transcription sequence of the sample to be tested by using the fusion gene capture probe. The target fusion The gene is composed of fusion of at least two different genes, and the fusion gene capture probe contains a sequence that can be complementary to the sequence of one of the target fusion genes;

比对筛选模块220，用于将测序数据与人类转录组和基因组数据进行比对，筛选能够同时匹配到至少两个基因的reads；以及an alignment screening module 220 for aligning the sequencing data with the human transcriptome and genome data, and screening reads that can match at least two genes simultaneously; and

融合分析模块230，用于分析能够同时匹配到至少两个基因的reads是否满足预设的阈值要求，如果满足，则说明该reads所包含的多个基因发生了基因融合。The fusion analysis module 230 is configured to analyze whether the reads that can match at least two genes at the same time meet the preset threshold requirements, and if so, it means that the multiple genes included in the reads have undergone gene fusion.

可选地，该基因融合变异检测装置200还包括：Optionally, the gene fusion mutation detection device 200 also includes:

变异比例计算模块240，用于按照如下公式计算融合基因的变异比例：The variation ratio calculation module 240 is used to calculate the variation ratio of the fusion gene according to the following formula:

其中，

in,

基于如上所述的实施例，本发明还提供了一种可用于基因融合变异检测的计算机设备，具有处理器和存储器，存储器上存储有计算机程序，处理器执行该计算机程序时实现上述任一实施例的基因融合变异检测方法的步骤。Based on the above embodiments, the present invention also provides a computer device that can be used for gene fusion mutation detection, which has a processor and a memory, and the memory stores a computer program. When the processor executes the computer program, any of the above implementations is implemented. Example of the steps of the gene fusion variant detection method.

本领域普通技术人员可以理解实现上述方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的程序可存储于一非易失性的计算机可读取存储介质中，如本发明实施例中，该程序可存储于计算机系统的存储介质中，并被该计算机系统中的至少一个处理器执行，以实现包括如上述各方法的实施例的流程。其中，所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory，ROM)或随机存储记忆体(RandomAccess Memory， RAM)等。Those of ordinary skill in the art can understand that all or part of the process in the above method can be implemented by instructing the relevant hardware through a computer program, and the program can be stored in a non-volatile computer-readable storage medium. In this embodiment of the present invention, the program may be stored in a storage medium of a computer system, and executed by at least one processor in the computer system, so as to implement the processes including the foregoing method embodiments. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

据此，本发明还提供了一种可用于基因融合变异检测的计算机存储介质，其上存储有计算机程序，计算机程序被执行时实现上述任一实施例的基因融合变异检测方法的步骤。Accordingly, the present invention also provides a computer storage medium that can be used for gene fusion mutation detection, which stores a computer program, and when the computer program is executed, implements the steps of the gene fusion mutation detection method in any of the above embodiments.

以下结合具体文库构建、检测方法的案例对本发明的基因融合变异文库构建方法和检测方法作进一步详细的说明。The construction method and detection method of the gene fusion variant library of the present invention will be further described in detail below with reference to the case of the specific library construction and detection method.

1)基于mRNA序列的DNA探针设计1) DNA probe design based on mRNA sequence

本案例通过DNA探针杂交捕获融合基因的转录序列，并进行高通量测序，经生物信息分析即可获得融合基因参与的热点或者新发的融合形式。In this case, the transcription sequence of the fusion gene was captured by DNA probe hybridization, and high-throughput sequencing was performed. After bioinformatics analysis, the hotspots involved in the fusion gene or the new fusion form could be obtained.

针对血液系统肿瘤(白血病与淋巴瘤)，选择54核心基因，即ABL1、 CREBBP、CRLF2、MECOM、TP53、TSLP、LMO2、PRDM16、MYC、ETV6、 RARA、NUP214、BCL6、MYB、IRF4、CBFB、CEBPB、ZNF384、RUNX1、 FGFR3、MALT1、ERG、NPM1、PAX5、JAK2、PICALM、FLT3、GLIS2、PDGFRB、 PML、TLX1、ITK、FGFR1、IL2RB、TAL1、WT1、NTRK3、NUP98、EPOR、 RBM15、CSF1R、KMT2A、BCL2、BCR、LYN、TLX3、CCND1、TCF3、CEBPA、 ABL2、ALK、PDGFRA、IGLL5、IGHA2。从Ensembl数据库获取转录本序列号，根据其序列设计重叠式2×覆瓦式序列(探针5’端标记生物素用于捕获)，得到探针库。For hematological tumors (leukemia and lymphoma), 54 core genes were selected, namely ABL1, CREBBP, CRLF2, MECOM, TP53, TSLP, LMO2, PRDM16, MYC, ETV6, RARA, NUP214, BCL6, MYB, IRF4, CBFB, CEBPB , ZNF384, RUNX1, FGFR3, MALT1, ERG, NPM1, PAX5, JAK2, PICALM, FLT3, GLIS2, PDGFRB, PML, TLX1, ITK, FGFR1, IL2RB, TAL1, WT1, NTRK3, NUP98, EPOR, RBM15, CSF1R, KMT2A , BCL2, BCR, LYN, TLX3, CCND1, TCF3, CEBPA, ABL2, ALK, PDGFRA, IGLL5, IGHA2. The transcript sequence number was obtained from the Ensembl database, and an overlapping 2× tiling sequence was designed according to its sequence (the 5' end of the probe was labeled with biotin for capture) to obtain a probe library.

2)样本总RNA提取2) Extraction of total RNA from samples

采用QIAGEN公司QIAsymphony RNA Kit(Cat#931636)试剂盒提取白血病淋巴瘤患者的外周血或骨髓样本的总RNA。具体操作步骤详见厂家的说明书。Total RNA was extracted from peripheral blood or bone marrow samples of patients with leukemia and lymphoma using the QIAsymphony RNA Kit (Cat#931636) from QIAGEN. Please refer to the manufacturer's manual for specific operation steps.

采用(1)NanoDrop分光光度仪测定核酸浓度及A260/A280值(预期值在 1.9-2.1之间)；(2)采用Qubit^TM RNA HS Assay Kit(Cat.#Q32855)测定核酸浓度。Use (1) NanoDrop spectrophotometer to measure nucleic acid concentration and A260/A280 value (expected value is between 1.9-2.1); (2) use Qubit ^TM RNA HS Assay Kit (Cat.#Q32855) to measure nucleic acid concentration.

3)消除核糖体rRNA3) Eliminate ribosomal rRNA

将500ng步骤2)中提取的总RNA与rRNA合成单链DNA探针杂交，并经RNaseH酶切rRNA，具体操作步骤详见NEBNext rRNA Depletion Kit试剂盒的说明书。采用

XP beads纯化消除rRNA后的RNA样本。Hybridize 500 ng of the total RNA extracted in step 2) with the rRNA synthetic single-stranded DNA probe, and digest the rRNA with RNaseH. For the specific operation steps, please refer to the instruction manual of the NEBNext rRNA Depletion Kit. use

RNA samples after purification of rRNA with XP beads.

4)逆转录合成cDNA4) Synthesize cDNA by reverse transcription

在PCR仪中，94℃孵育6分钟，使RNA片段化。使用逆转录酶(Reversetranscriptase)将片段化的RNA反转录成单链c'DNA。Incubate at 94°C for 6 minutes in a PCR machine to fragment the RNA. The fragmented RNA was reverse transcribed into single-stranded c'DNA using Reversetranscriptase.

5)合成cDNA第二链5) Synthesize the second strand of cDNA

使用DNA Polymerase I,Large(Klenow)Fragment将单链的c'DNA合成双链 cDNA。此处使用dUTP代替dTTP。因此第二链cDNA嵌入dUTP。采用AMPure XP Beads纯化双链cDNA。Single-stranded c'DNA was synthesized into double-stranded cDNA using DNA Polymerase I, Large (Klenow) Fragment. dUTP is used here instead of dTTP. Thus the second strand cDNA is embedded in dUTP. Double-stranded cDNA was purified using AMPure XP Beads.

6)末端修复6) End Repair

使用NEBNext Ultra II End Prep Enzyme Mix处理双链cDNA，并在3’末端添加一个dATP。Double-stranded cDNA was treated with NEBNext Ultra II End Prep Enzyme Mix and a dATP was added to the 3' end.

7)连接接头7) Connect the connector

将连接酶Ligase、含12nt唯一序列的Index接头与末端修复cDNA混合，在PCR仪中，16℃孵育60分钟，获得连接接头的cDNA文库。The ligase Ligase, the Index adapter containing the 12nt unique sequence and the end repair cDNA were mixed, and incubated in a PCR machine at 16°C for 60 minutes to obtain a cDNA library ligated with the adapter.

接头格式：P5-Read1primer-DNA INSERT-IndexReadprimer-index-P7。Adapter format: P5-Read1primer-DNA INSERT-IndexReadprimer-index-P7.

8)酶切制造cDNA第二链缺口8) Enzymatic cleavage to create a gap in the second strand of cDNA

将uracil DNA glycosylase(UDG)与Endonuclease VIII mix加入到以上体系，二者协同消化cDNA文库片段中的dUTP，使之产生缺口。The uracil DNA glycosylase (UDG) and Endonuclease VIII mix were added to the above system, and the two synergistically digested the dUTP in the cDNA library fragment to make a gap.

9)文库扩增9) Library amplification

使用KAPA HiFi HotStart ReadyMix、与接头P5、P7序列配对的引物(P5: 5'-AATGATACGGCGACCACCGA-3'，SEQ ID NO:1；P7: 5'-CAAGCAGAAGACGGCATACGAGAT-3'，SEQ IDNO:2)将以上cDNA文库在PCR仪中进行扩增。采用AMPure XP Beads纯化cDNA预文库。Using KAPA HiFi HotStart ReadyMix, primers paired with linker P5, P7 sequences (P5: 5'-AATGATACGGCGACCACCGA-3', SEQ ID NO: 1; P7: 5'-CAAGCAGAAGACGGCATACGAGAT-3', SEQ ID NO: 2) The library is amplified in a PCR machine. cDNA pre-libraries were purified using AMPure XP Beads.

10)探针捕获杂交10) Probe capture hybridization

将100ng制备好的cDNA文库与

Universal Blockers-TS Mix、Human Cot-1DNA混合，使用真空抽滤系统(60℃)干燥成干粉。然后加入杂交缓冲液、融合基因探针库混合，在PCR仪中95℃孵育30秒，65℃杂交16-18小时。100ng of the prepared cDNA library was mixed with

Universal Blockers-TS Mix and Human Cot-1 DNA were mixed and dried to dry powder using a vacuum filtration system (60°C). Then add hybridization buffer, mix the fusion gene probe library, incubate at 95°C for 30 seconds in a PCR machine, and hybridize at 65°C for 16-18 hours.

将以上体系与链霉素亲和素磁珠

M-270Streptavidin beads混合，在PCR仪上进行65℃孵育45min，期间每间隔15min进行重新混匀。筛选所有含融合基因的转录序列片段。Combine the above system with streptavidin magnetic beads

Mix M-270 Streptavidin beads and incubate at 65°C for 45min on a PCR machine, and remix at 15min intervals. Screen all transcribed sequence fragments containing the fusion gene.

使用KAPA HiFi HotStart ReadyMix、与接头P5、P7序列配对的引物(P5: 5'-AATGATACGGCGACCACCGA-3'，SEQ ID NO:1；P7: 5'-CAAGCAGAAGACGGCATACGAGAT-3'，SEQ IDNO:2)对以上杂交捕获的 cDNA文库在PCR仪中进行扩增。采用AMPure XP Beads纯化目的cDNA文库，得到待测序的文库。The above was hybridized using KAPA HiFi HotStart ReadyMix, primers paired with the sequences of linkers P5, P7 (P5: 5'-AATGATACGGCGACCACCGA-3', SEQ ID NO: 1; P7: 5'-CAAGCAGAAGACGGCATACGAGAT-3', SEQ ID NO: 2) The captured cDNA library was amplified in a PCR machine. The target cDNA library was purified by AMPure XP Beads to obtain the library to be sequenced.

11)Illumina平台测序11) Illumina platform sequencing

待测序的文库使用Novaseq 6000高通量测序仪进行测序，测序深度为平均5000x。测序操作步骤详见厂家的说明书。Libraries to be sequenced were sequenced using a Novaseq 6000 high-throughput sequencer with an average sequencing depth of 5000x. See the manufacturer's instructions for sequencing procedures.

12)测序数据分析12) Sequencing data analysis

A.测序数据预处理A. Sequencing data preprocessing

使用bcl2fastq软件对原始数据转换得到raw fastq文件，经fastqc软件对rawfastq数据进行质量评估，利用Trimmomatic软件剔除低质量reads，得到clean fastq文件。Use bcl2fastq software to convert raw data to obtain raw fastq files, and use fastqc software to evaluate the quality of rawfastq data, and use Trimmomatic software to eliminate low-quality reads to obtain clean fastq files.

B.融合基因鉴定B. Identification of fusion genes

用BOWTIE、STAR、SPOTLIGHT软件将所有reads与人类转录组和基因组比对，筛选同时匹配到两个基因的转录本的reads。然后通过一系列标准，如旁系同源(paralog)、假基因、Body Map 2.0、基因距离等剔除假阳性事件。如果同时匹配到某两个基因的reads超过设定的阈值，就认定这两个基因发生了基因融合。All reads were aligned with the human transcriptome and genome using BOWTIE, STAR, and SPOTLIGHT software, and reads matching transcripts of both genes were screened. False positive events were then eliminated by a series of criteria, such as paralog, pseudogene, Body Map 2.0, gene distance, etc. If the reads matching two genes at the same time exceed the set threshold, it is considered that the two genes have undergone gene fusion.

C.融合基因检测数据分析结果实例C. Examples of Fusion Gene Detection Data Analysis Results

应用本发明，我们检测3例白血病样本，获得以下结果：Using the present invention, we detected 3 leukemia samples and obtained the following results:

该3个样本均具有MLL(KMT2A)参与的融合基因，仅通过靶向MLL基因转录本序列的探针即可同时抓取MLL基因与其partner基因的断裂点序列，从而通过比对分析而鉴定其具体融合形式，并计算fusionFPKM作为其表达量的指标。The three samples all have fusion genes involved in MLL (KMT2A), and only the probe targeting the transcript sequence of the MLL gene can capture the breakpoint sequences of the MLL gene and its partner gene at the same time, so as to identify them through alignment analysis. The specific fusion form was calculated, and fusionFPKM was calculated as an indicator of its expression.

结果见下表1。The results are shown in Table 1 below.

表1Table 1

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-described embodiments can be combined arbitrarily. For the sake of brevity, all possible combinations of the technical features in the above-described embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, All should be regarded as the scope described in this specification.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present invention, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of the present invention, several modifications and improvements can also be made, which all belong to the protection scope of the present invention. Therefore, the protection scope of the patent of the present invention should be subject to the appended claims.

序列表sequence listing

<110> 广州金域医学检验集团股份有限公司<110> Guangzhou Jinyu Medical Laboratory Group Co., Ltd.

<120> 基因融合变异文库构建方法、检测方法、装置、设备及存储介质<120> Gene fusion variant library construction method, detection method, apparatus, equipment and storage medium

<140> 2019114192739<140> 2019114192739

<141> 2019-12-31<141> 2019-12-31

<160> 7<160> 7

<170> SIPOSequenceListing 1.0<170> SIPOSequenceListing 1.0

<210> 1<210> 1

<211> 20<211> 20

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 1<400> 1

aatgatacgg cgaccaccga 20aatgatacgg cgaccaccga 20

<210> 2<210> 2

<211> 24<211> 24

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 2<400> 2

caagcagaag acggcatacg agat 24caagcagaag acggcatacg agat 24

<210> 3<210> 3

<211> 38<211> 38

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 3<400> 3

gatctacact ctttccctac acgacgctct tccgatct 38gatctacact ctttccctac acgacgctct tccgatct 38

<210> 4<210> 4

<211> 24<211> 24

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 4<400> 4

gttcgtcttc tgccgtatgc tcta 24gttcgtcttc tgccgtatgc tcta 24

<210> 5<210> 5

<211> 86<211> 86

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 5<400> 5

tccccgccca agtatccctg taaaacaaaa accaaaagaa aagtctgaac aacccagtcc 60tccccgccca agtatccctg taaaacaaaa accaaaagaa aagtctgaac aacccagtcc 60

tgccagctcc agctccagct ccagct 86tgccagctcc agctccagct ccagct 86

<210> 6<210> 6

<211> 86<211> 86

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 6<400> 6

tccccgccca agtatccctg taaaacaaaa accaaaagaa aaggaaatga cccattcatg 60tccccgccca agtatccctg taaaacaaaa accaaaagaa aaggaaatga cccattcatg 60

gccgcctcct ttgacagcaa tacata 86gccgcctcct ttgacagcaa tacata 86

<210> 7<210> 7

<211> 86<211> 86

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 7<400> 7

aattccagca gatggagtcc acaggatcag agtggacttt aaggattctg tttcactgag 60aattccagca gatggagtcc acaggatcag agtggacttt aaggattctg tttcactgag 60

gccatctatc cgatttcaag gaagcc 86gccatctatc cgatttcaag gaagcc 86

Claims

1. a gene fusion variation library construction method, is characterized in that, comprises the steps:

Extract the total RNA of the sample and remove the rRNA;

Reverse transcription of the total RNA after removing the rRNA to synthesize double-stranded cDNA, and use dUTP instead of dTTP to synthesize the second strand of the double-stranded cDNA;

performing end repair on the synthesized double-stranded cDNA and adding a ligation linker;

Enzymatic digestion and digestion of the dUTP in the double-stranded DNA after end repair and addition of the ligation linker, so that the double-stranded cDNA is nicked;

Amplify the double-stranded DNA after digestion and digestion to construct a cDNA pre-library;

Use a fusion gene capture probe to hybridize and capture the target fusion cDNA in the cDNA pre-library, the target fusion cDNA is composed of fusion of at least two different genes, and the fusion gene capture probe contains a cDNA capable of being fused with the target The sequence of the complementary pairing of the sequences of one of the genes;

Amplify the captured target fusion cDNA to obtain the gene fusion variant library.

2. gene fusion variation library construction method as claimed in claim 1, is characterized in that, the design principle of described fusion gene capture probe is as follows:

(1) The fusion gene capture probe is designed for the core gene in the target fusion cDNA, and the core gene refers to a gene that has multiple gene partners and is prone to fusion mutation, or is in the cell growth or proliferation signaling pathway. key genes, or driver genes;

(2) the fusion gene capture probe is designed for the transcript sequence of the core gene;

(3) The fusion gene capture probe is designed for the core gene in the hg19 reference genome, and the coverage density is 2 × tiling sequences;

(4) the length of the fusion gene capture probe is 120bp;

(5) The fusion gene capture probe needs to be compared to the human transcriptome sequence during design, and the number of all Blast matches is counted. If the number of Blast matches is not greater than 50, it is qualified. If the number of Blast matches is greater than 50, then Redesign by replacing mismatched bases until the highest match to the target gene sequence is obtained and the number of Blast matches is not more than 50.

3. the gene fusion variation library construction method as claimed in claim 1 or 2, is characterized in that, the 5 ' end of described fusion gene capture probe is marked with the linker that is used for capturing;

Optionally, the linker is biotin or streptavidin.

4. The method for constructing a gene fusion variant library according to claim 1 or 2, wherein the total RNA of the sample is the total RNA of peripheral blood or bone marrow samples.

5. The gene fusion variant library construction method of claim 1 or 2, wherein the end repair is to add a dATP at the 3' end of the synthesized double-stranded cDNA;

The linker format introduced by the added linker is P5-Real1primer-DNAINSERT-IndexReadprimer-index-P7, specifically: 5'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC*T-DNA fragment sequence to be tested-GTTCGTCTTCTGCCGTATGCTCTA-index-CACTGACCTCAAGTCTGCACACGAGAAGGCTAG-P, wherein P5 and P7 is the linker, Real1primer and IndexReadprimer are the primer sequences, DNAINSERT is the sequence of the DNA fragment to be tested, index is the unique sample tag of 12nt, and p is the phosphate group.

6. The method for constructing a gene fusion variant library according to claim 5, wherein the double-stranded DNA after the digestion and digestion of the amplifying enzyme and the target fusion cDNA captured are amplified using the same method as a linker. Amplification was performed with primers paired with P5 and P7 sequences.

7. a gene fusion mutation detection method, is characterized in that, comprises the steps:

Obtain the sequencing data of the gene fusion variant library, the gene fusion variant library is an amplification library of the target fusion gene obtained by hybridizing and capturing the transcription sequence of the sample to be tested by the fusion gene capture probe, and the target fusion gene is composed of at least one. Composed of two different genes fused, the fusion gene capture probe contains a sequence capable of complementary pairing with the sequence of one of the target fusion genes;

Comparing the sequencing data with human transcriptome and genome data, and screening reads that can match at least two genes at the same time;

It is analyzed whether the reads that can match at least two genes at the same time meet the preset threshold requirements, and if so, it means that a plurality of genes included in the reads have undergone gene fusion.

8. gene fusion mutation detection method as claimed in claim 7, is characterized in that, in described by described sequencing data and human transcriptome and genome data are compared, screening can be matched to the reads of at least two genes at the same time. The steps also include:

The quality of the sequencing data is evaluated, and low-quality reads are eliminated to obtain clean sequencing data.

9. gene fusion mutation detection method as claimed in claim 8, is characterized in that, described eliminating low-quality reads comprises:

Remove reads containing linker sequences;

Remove reads with low-quality bases with a quality value below 15 accounting for ≧50%;

Reads containing more than 1% of N were removed.

10. The method for detecting gene fusion variation according to claim 9, further comprising: after comparing the sequencing data with human transcriptome and genome data, rejecting the clean sequencing data according to a preset control standard steps in false positive events;

Specifically, annotate the gene fusion mutation events obtained by screening, remove the false and preserve the true, and eliminate the gene fusion mutation events that meet the following criteria:

The different genes of the fusion gene are paralogous to each other;

The different genes of the fusion gene are pseudogenes;

The gene fusion variant has been detected in normal healthy people.

11. gene fusion mutation detection method as claimed in claim 10, is characterized in that, described preset threshold requirement refers to: if this fusion gene mutation has clinical significance, then matches to the unique spanning reads of these two genes simultaneously More than 3; if the fusion gene variant is of unknown clinical significance, the unique spanning reads that match the two genes at the same time exceed 10.

12. The gene fusion mutation detection method according to any one of claims 7 to 11, characterized in that, further comprising:

Calculate the mutation ratio of the fusion gene according to the following formula:

in,

The fusion supporting read pairs refers to the number of read pairs supporting the gene fusion;

Described #mappable reads refers to the number of reads of the genome in comparison;

The weighted-average of Insertsize-read length refers to the weighted average length of the cDNA fragments inserted into the library;

The refgeneFPKM is the normalized expression value of the internal reference gene;

The FPKM is defined as Reads Per Kilobase of exon model per Million mapped reads, that is, the number of reads aligned to every 1K bases of an exon in every 1 million aligned reads.

13. A gene fusion mutation detection device, characterized in that, comprising:

The sequencing data acquisition module is used to obtain the sequencing data of the gene fusion variant library. The gene fusion variant library is an amplification library of the target fusion gene obtained by hybridizing and capturing the transcription sequence of the sample to be tested by using the fusion gene capture probe. The target fusion gene is composed of fusion of at least two different genes, and the fusion gene capture probe contains a sequence capable of complementary pairing with the sequence of one of the target fusion genes;

an alignment screening module for aligning the sequencing data with human transcriptome and genome data, and screening reads that can match at least two genes at the same time; and

The fusion analysis module is used to analyze whether the reads that can match at least two genes at the same time meet the preset threshold requirements.

14. The gene fusion mutation detection device of claim 13, further comprising:

The variation ratio calculation module is used to calculate the variation ratio of the fusion gene according to the following formula:

in,

The weighted-average of Insertsize-read length refers to the weighted average length of the inserted cDNA fragments in the library;

15. A computer device, comprising a processor and a memory, wherein the memory stores a computer program, and the processor implements the gene according to any one of claims 7 to 12 when the processor executes the computer program Steps of a fusion variant detection method.

16. A computer storage medium on which a computer program is stored, characterized in that, when the computer program is executed, the steps of the gene fusion mutation detection method according to any one of claims 7 to 12 are implemented.