CN111326212A

CN111326212A - A method for detecting structural variation

Info

Publication number: CN111326212A
Application number: CN202010098320.0A
Authority: CN
Inventors: 伍林军; 白健; 茹兰兰; 郑璐
Original assignee: Berry Oncology Co Ltd
Current assignee: Berry Oncology Co Ltd
Priority date: 2020-02-18
Filing date: 2020-02-18
Publication date: 2020-06-23
Anticipated expiration: 2040-02-18
Also published as: CN111326212B

Abstract

The invention discloses a method for detecting structural variation, which simultaneously supports the identification of the structural variation of RNA and DNA, and has the advantages of high sensitivity, high specificity, high speed and low resource consumption. The invention also provides a complete system or device, a computer readable storage medium and equipment established based on the method.

Description

A method for detecting structural variation

技术领域technical field

本发明属于基因检测技术领域，具体涉及一种结构变异的检测方法及其相关的系统、装置、计算机可读存储介质、设备。The invention belongs to the technical field of gene detection, and in particular relates to a detection method for structural variation and a related system, device, computer-readable storage medium and equipment.

背景技术Background technique

起源于基因组内部的结构变异，包括同一条染色体内部的缺失，倒位，复制，以及不同染色体之间的异常连接。无论是哪种事件造成，其结果往往表现出两个基因的不同部分物理上连接到一起，转录以后，可以在转录水平上得到一个由两个不同基因的转录本的一部分构成的一个新的转录本。这些结构变异基因在癌症的发生发展过程中具有重要的科学意义，对于研究肿瘤的发生发展机理，肿瘤的治疗和监控具有极其重要的医学价值。例如BCR->ABL在血液肿瘤和膀胱癌，肺癌，恶性胶质瘤等肿瘤中广泛存在，FGFR3->TACC3主要出现在膀胱癌、宫颈鳞癌与宫颈腺癌中，EML4->ALK主要出现在肺癌中。Originating from structural variation within the genome, including deletions, inversions, duplications within the same chromosome, and abnormal connections between different chromosomes. No matter what kind of event is caused, the result often shows that the different parts of the two genes are physically linked together. After transcription, a new transcript consisting of parts of the transcripts of two different genes can be obtained at the transcriptional level. Book. These structurally variable genes have important scientific significance in the occurrence and development of cancer, and have extremely important medical value for studying the mechanism of tumor occurrence and development, and for tumor treatment and monitoring. For example, BCR->ABL is widely found in hematological tumors and bladder cancer, lung cancer, malignant glioma and other tumors, FGFR3->TACC3 mainly occurs in bladder cancer, cervical squamous cell carcinoma and cervical adenocarcinoma, EML4->ALK mainly occurs in in lung cancer.

基于二代测序技术手段的结构变异检测技术已经出现较长时间了，主要是通过对目标区域或者全基因组进行测序，对测序得到的序列进行分析来判断结构变异的发生与否。Structural variation detection technology based on next-generation sequencing technology has emerged for a long time, mainly by sequencing the target region or the whole genome, and analyzing the sequence obtained to determine the occurrence of structural variation.

DNA水平的检测主要是通过将测序得到的数据比对到基因组上，根据比对的读段是否发生了断裂比对，也就是读段的两部分分别比对到基因组的不同位置，来搜集可能支持结构变异发生事件的证据，如果发生了断裂比对，则进一步对发生断裂比对的读段两部分进行分析，根据两部分比对的位置，链向来反推结构变异发生的起因，计算结构变异发生的结果。如果是双端测序，同一个模版经过测序会有两条读段产生，则可以根据两条配对的读段的比对情况是否异常收集支持结构变异的异常读段对，正常情况下两条读段应该一条比对到基因组的正链，一条比对到基因组的负链，且从核酸的转录方向看，是一致的，且插入片段长度在合理分布范围之内，如果两条读段来自结构变异基因的两部分，则会出现方向不正常，或者隐含的插叙片段长度不正常，但是目前发布的诸多方法存在计算时间长，灵敏度低，假阳性高，无注释模块等缺点。The detection of the DNA level is mainly by aligning the data obtained by sequencing with the genome, and according to whether the aligned reads have a break alignment, that is, the two parts of the read are aligned to different positions in the genome, to collect possible Evidence that supports the occurrence of structural variation. If there is a break alignment, the two parts of the read with the break alignment are further analyzed. According to the position of the alignment of the two parts, the chain has always reversed the cause of the structural variation, and calculated the structure. result of mutation. In the case of paired-end sequencing, two reads will be generated from the same template after sequencing, and abnormal read pairs that support structural variation can be collected according to whether the alignment of the two paired reads is abnormal. Normally, two reads One segment should be aligned to the positive strand of the genome and the other to the negative strand of the genome, and the transcription direction of the nucleic acid should be consistent, and the length of the inserted fragment should be within a reasonable distribution range. If the two reads come from the structure For the two parts of the mutated gene, the direction is abnormal, or the length of the implied insert fragment is abnormal. However, many methods currently released have shortcomings such as long calculation time, low sensitivity, high false positive, and no annotation module.

RNA水平的检测往往需要借助基因组和转录组两大参考序列的比对，通过转录组比对，然后将坐标映射到基因组，再通过读段的比对特征，参考DNA计算的判断方法，推导出结构变异事件发生的机理，计算出发生结构变异的类型和基因，这种技术方案不仅耗费资源过高，且受到内含子干扰往往导致计算不准确，假阳性和假阴性都比较多，计算结果往往也缺乏注释，使用极不方便。The detection of RNA level often requires the alignment of the two reference sequences, the genome and the transcriptome, through the transcriptome alignment, and then the coordinates are mapped to the genome, and then through the alignment features of the read segments, referring to the judgment method of DNA calculation, deduce The mechanism of the occurrence of structural variation events, and to calculate the types and genes of structural variation. This technical solution not only consumes too much resources, but also suffers from intron interference, which often leads to inaccurate calculations, and there are many false positives and false negatives. Calculation results It is also often lack of annotations, which is extremely inconvenient to use.

发明内容SUMMARY OF THE INVENTION

本发明的目的之一是针对现有技术存在的缺陷，提供一种同时支持RNA和DNA的结构变异识别，灵敏度高，特异性高，速度快，资源消耗小的结构变异检测方法。One of the objectives of the present invention is to provide a structural variation detection method that supports both RNA and DNA structural variation identification, with high sensitivity, high specificity, high speed and low resource consumption, aiming at the defects of the prior art.

为了实现以上目的，本发明提供了一种结构变异的检测方法，所述检测方法包含以下步骤：In order to achieve the above object, the present invention provides a detection method for structural variation, the detection method comprises the following steps:

1)将测序数据比对至参考基因组序列或参考主转录本序列；1) aligning the sequencing data to a reference genome sequence or a reference master transcript sequence;

2)寻找正常比对读段(read)、发生断裂比对的读段和不一致比对读段对；2) Look for the normal aligned reads (reads), the reads with fragmented alignments, and the inconsistent aligned reads;

3)对发生断裂比对的读段和不一致比对读段对进行分类；3) Categorize the read segment in which the fragmentation alignment occurred and the inconsistently aligned read pair;

4)分别对不同类别的断裂比对读段和不一致比对读段对进行分组，将支持同一个结构变异事件的读段归入同一个集合；4) Grouping different types of fragmented alignment reads and inconsistent alignment read pairs respectively, and grouping reads supporting the same structural variation event into the same set;

5)对于由断裂比对读段确定的结构变异事件，通过对支持该结构变异事件的读段进行组装形成保守序列；5) For the structural variation event determined by the fragmented alignment reads, a conserved sequence is formed by assembling the reads supporting the structural variation event;

6)基于保守序列确定精确的断点位置；6) Determine precise breakpoint positions based on conserved sequences;

7)分别对断裂比对读段支持的结构变异和不一致比对读段对支持的结构变异进行合并，所述合并是指将断点相近、类型相同的结构变异事件合并为同一个结构变异事件；7) Merge the structural variants supported by the break alignment reads and the structural variants supported by the inconsistent alignment reads respectively, and the merging refers to merging structural variant events with similar breakpoints and the same type into the same structural variant event ;

8)将断点相近、类型相同的断裂比对读段支持的结构变异与不一致比对读段对支持的结构变异合并；8) Merging the structural variants supported by the break alignment reads with similar breakpoints and the same type with the structural variants supported by the discordant alignment read pairs;

9)删除保守序列能够完整的连续匹配到基因组上一段序列或者能够发生多处一致性比对的结构变异事件；9) Deletion of conserved sequences that can be completely and continuously matched to a sequence on the genome or structural variation events that can generate multiple consistent alignments;

10)计算结构变异事件频率。10) Calculate the frequency of structural variation events.

在一个具体实施方案中，步骤1)中，如果测序数据为DNA数据，则将其比对至参考基因组序列；如果测序数据为RNA数据，则将其比对至参考主转录本序列。In a specific embodiment, in step 1), if the sequencing data is DNA data, it is aligned to the reference genome sequence; if the sequencing data is RNA data, it is aligned to the reference master transcript sequence.

在一个具体实施方案中，步骤2)中还包含通过所述正常比对读段统计插入片段长度、计算插入片段长度分布的主要参数的步骤；所述主要参数优选最大值、最小值和/或均值。In a specific embodiment, step 2) further comprises the step of calculating the length of the insert by using the normal alignment reads, and calculating the main parameter of the length distribution of the insert; the main parameter is preferably a maximum value, a minimum value and/or a mean.

在一个具体实施方案中，步骤3)中，发生断裂比对的读段的分类可以基于以下指标进行：In a specific embodiment, in step 3), the classification of the reads with fragmentation alignment can be performed based on the following indicators:

发生断裂比对的两部分是否比对到同一条染色体、是否比对到基因组的不同方向和/或是否剪切位置都在比对位置的上游。Whether the two parts of the split alignment are aligned to the same chromosome, aligned to different orientations of the genome, and/or whether the splicing position is upstream of the aligned position.

在一个具体实施方案中，步骤3)中，发生断裂比对的读段可以按照下表标准分类：In a specific embodiment, in step 3), the reads with fragmentation alignment can be classified according to the following criteria:

在一个具体实施方案中，步骤3)中，不一致比对读段对的分类可以基于以下指标进行：In a specific embodiment, in step 3), the classification of discordantly aligned read pairs can be performed based on the following indicators:

是否比对到同一条染色体、是否比对到基因组的不同方向和/或插入片段大小。Whether to align to the same chromosome, to different orientations of the genome and/or insert size.

在一个具体实施方案中，步骤3)中，不一致比对读段对可以按照下表标准分类：In a specific embodiment, in step 3), the discordantly aligned read pairs can be classified according to the following criteria:

在一个具体实施方案中，可以利用SA标签寻找发生断裂比对的读段和/或不一致比对读段对。In a specific embodiment, SA tags can be used to find reads that have split alignments and/or discordantly aligned read pairs.

在一个具体实施方案中，识别断裂比对的读段时，如果部分比对到其他地方，则不予计算。In a specific embodiment, when identifying reads that are split aligned, parts that align elsewhere are not counted.

在一个具体实施方案中，识别断裂比对的读段时，可以将来自同一个比对读段的断裂比对记录作为一个实体进行考虑。In a specific embodiment, when identifying reads of a break alignment, a break alignment record from the same aligned read can be considered as one entity.

在一个具体实施方案中，步骤4)中，所述分组可以通过聚类分析进行。In a specific embodiment, in step 4), the grouping can be performed by cluster analysis.

优选地，对断裂比对读段进行聚类的标准包括结构变异类型、断裂比对读段两部分比对的参考序列名称、第一个断裂点位置和/或第二个断裂点位置。Preferably, the criteria for clustering the break-aligned reads include the type of structural variation, the reference sequence name for the two-part alignment of the break-aligned reads, the first breakpoint position and/or the second breakpoint position.

进一步地，将结构变异类型相同、比对参考序列名相同、并且第一和第二断裂点位置相距m个碱基以内的读段作为一类，其中，m为30以内的自然数，优选10。Further, reads with the same type of structural variation, the same name of the alignment reference sequence, and the first and second breakpoint positions are within m bases apart as a class, where m is a natural number within 30, preferably 10.

进一步地，如果一类中的读段支持数高于预先设定的阈值(该阈值选自1以上的自然数)，则使用该类中的所有读段断点位置信息取均值得到平均断点位置。Further, if the read support number in a class is higher than a preset threshold (the threshold is selected from a natural number above 1), then the average breakpoint position is obtained by using the average value of all the read breakpoint position information in the class. .

优选地，对不一致比对读段对进行聚类的标准包括结构变异类型、读段比对的参考序列名、配对读段比对的参考序列名、读段比对位置和/或配对读段比对的位置。Preferably, the criteria for clustering discordantly aligned read pairs include the type of structural variation, the reference sequence name for the read alignment, the reference sequence name for the paired read alignment, the read alignment position and/or the paired read comparison location.

进一步地，将比对参考序列名相同、并且比对位置差别在最大插入片段大小范围以内的读段作为一类。Further, the reads with the same name of the alignment reference sequence and the alignment position difference within the range of the maximum insert size are regarded as one type.

进一步地，如果一类中的读段支持数高于预先设定的阈值(该阈值选自1以上的自然数，优选为2)，则使用该类中的读段对所确定的断点范围来估计断点位置；优选地，根据读段对的比对起始位置和终止位置，采用渐进方法不断缩小断点位置所在区间来估计断点位置。Further, if the read support number in a class is higher than a preset threshold (the threshold is selected from a natural number above 1, preferably 2), then the breakpoint range determined by the read pairs in the class is used to determine the breakpoint range. Estimating the breakpoint position; preferably, according to the alignment start position and end position of the read pair, a progressive method is used to continuously narrow the interval where the breakpoint position is located to estimate the breakpoint position.

在一个具体实施方案中，步骤5)中，所述组装可通过多重序列比对进行。In a specific embodiment, in step 5), the assembly can be performed by multiple sequence alignment.

在一个具体实施方案中，步骤5)中，还包含对参与组装的序列进行解析重构的步骤，所述解析重构包括：提取出断点附近的短的插入序列和/或将序列方向调整为读段5’端与参考序列方向一致。In a specific embodiment, step 5) further includes the step of parsing and reconstructing the sequence involved in the assembly, the parsing and reconstructing comprising: extracting a short insertion sequence near the breakpoint and/or adjusting the sequence direction The 5' end of the read is in the same direction as the reference sequence.

在一个具体实施方案中，步骤6)中，将保守序列与参考断点序列进行比对，根据断裂比对发生的位置，确定精确的断点位置；所述参考断点序列包含两部分，一部分来自于跨越结构变异事件的一个断点的参考序列，另一部分来自于跨越结构变异事件的另一个断点的参考序列。In a specific embodiment, in step 6), the conserved sequence is aligned with the reference breakpoint sequence, and the exact breakpoint position is determined according to the position where the breakpoint alignment occurs; the reference breakpoint sequence comprises two parts, one part The other portion is derived from the reference sequence spanning one breakpoint of the structural variation event, and the other portion is derived from the reference sequence spanning the other breakpoint of the structural variation event.

在一个具体实施方案中，步骤7)中，合并时保留支持数更高的结构变异事件。In a specific embodiment, in step 7), structural variation events with higher support numbers are retained when merging.

在一个具体实施方案中，步骤7)中，对于断裂比对读段支持的结构变异事件，进行合并的条件包括断点距离在二者任何一个的同源序列长度之内或者小于n个碱基，并且为同类型结构变异，其中，n为30以内的自然数，优选10。In a specific embodiment, in step 7), for the structural variation events supported by the break alignment reads, the conditions for merging include that the breakpoint distance is within the length of any one of the two homologous sequences or less than n bases , and is the same type of structural variation, where n is a natural number within 30, preferably 10.

在一个具体实施方案中，步骤7)中，对于不一致比对读段对支持的结构变异事件，进行合并的条件包括断点距离在最大插入序列长度范围之内，并且为同类型结构变异。In a specific embodiment, in step 7), for the structural variation events supported by the discordantly aligned read pairs, the conditions for merging include that the breakpoint distance is within the range of the maximum insertion sequence length, and is the same type of structural variation.

在一个具体实施方案中，步骤8)中，进行合并的条件包括一个断裂比对读段支持的结构变异事件的断点与另一个类型相同的不一致比对读段对支持的结构变异事件的断点距离在最大插入序列长度之内。In a specific embodiment, in step 8), the conditions for merging include a breakpoint of a structural variation event supported by a break alignment read and a breakpoint of a structural variation event supported by another discordant alignment read of the same type. The point distance is within the maximum insertion sequence length.

在一个具体实施方案中，步骤9)中可以使用BWT算法进行比对。In a specific embodiment, the alignment can be performed using the BWT algorithm in step 9).

在一个具体实施方案中，步骤9)中，如果保守序列能够完全比对到参考基因组序列或参考主转录本上的一段连续区域，则对该组记录进行记录和/或滤除；如果保守序列能够在参考基因组序列或参考主转录本序列上发生断裂比对，且断裂比对确定的断点位置与之前计算的断点位置差别在10bp以内，并且这样的断裂比对超过2组，则对该组记录进行记录和/或滤除；如果这样的断裂比对为1组，则根据断点位置进一步精确断点；如果这样的断裂比对小于1组，则不进行标记和/或修改。In a specific embodiment, in step 9), if the conserved sequence can be completely aligned to the reference genome sequence or a continuous region on the reference master transcript, the set of records is recorded and/or filtered out; if the conserved sequence The break alignment can occur on the reference genome sequence or the reference main transcript sequence, and the breakpoint position determined by the break alignment is within 10bp of the previously calculated breakpoint position, and there are more than 2 groups of such break alignments. This group of records is recorded and/or filtered; if such a break alignment is 1 group, the breakpoint is further refined according to the breakpoint position; if such a break alignment is less than 1 group, no marking and/or modification is performed.

在一个具体实施方案中，步骤10)中，所述结构变异事件频率通过结构变异事件两个断点的参考型计数和变异型计数计算，所述参考型为与参考基因组序列或参考主转录本序列一致的类型，所述变异型为与结构变异事件序列一致的类型；优选地，采用断点两侧支持数较高的参考型作为该结构变异事件的参考型计数。In a specific embodiment, in step 10), the structural variation event frequency is calculated by the reference type count and variant type count of two breakpoints of the structural variation event, and the reference type is the same as the reference genome sequence or the reference master transcript. The type of sequence consistency, the variant type is the type consistent with the sequence of the structural variation event; preferably, the reference type with higher support numbers on both sides of the breakpoint is used as the reference type count of the structural variation event.

进一步地，所述结构变异事件频率＝支持结构变异的分子数/(支持结构变异的分子数+支持参考型的分子数)。Further, the frequency of the structural variation event=the number of molecules supporting the structural variation/(the number of molecules supporting the structural variation+the number of molecules supporting the reference type).

优选地，如果同一个模版的两条读段均支持某一个结构变异事件，则计算为一个分子数。Preferably, if two reads of the same template support a certain structural variation event, it is counted as one molecule.

本发明的检测方法还可以包含融合基因注释和/或输出检测结果的步骤；优选地，注释的信息条目包括但不限于融合基因名字、融合基因断点位置、外显子、内含子位置、连接是否正确、结构变异序列和/或各类结构变异支持数；优选地，结果输出采用制表符分割模式或二进制压缩模式。The detection method of the present invention may further comprise the step of annotating the fusion gene and/or outputting the detection result; preferably, the annotated information items include but are not limited to the fusion gene name, fusion gene breakpoint position, exon, intron position, Whether the connection is correct, the sequence of structural variants and/or the number of structural variants supported; preferably, the result output is in tab-separated mode or binary compression mode.

现有技术中的结构变异检测软件通常不具有融合基因注释模块，往往需要借助第三方注释软件进行，第三方注释软件往往并不能充分利用结构变异计算软件的数据，注释往往不充分，甚至失败。为了解决这一问题，基于本发明方法开发的软件可以内置注释模块，计算完毕结构变异事件，能够直接注释出结构变异基因。Structural variation detection software in the prior art usually does not have a fusion gene annotation module, and often requires third-party annotation software. The third-party annotation software often cannot fully utilize the data of structural variation calculation software, and the annotation is often insufficient or even fails. In order to solve this problem, the software developed based on the method of the present invention can have a built-in annotation module, after calculating the structural variation events, the structural variation genes can be directly annotated.

为了达到个性化输出的目的，可以对结构变异报告的范围设定各种标准，例如支持数的限制，报出范围的限制，结构变异方向的限制等等，报告格式可分为两类，一类是制表符分割模式，便于人工阅读，一类是二进制压缩格式，便于计算机处理。还可以进一步设置对结果进行进一步挖掘的伴侣工具模块，执行根据已有的结果进行过滤、重新生成报告、合并支持背景池等操作。In order to achieve the purpose of personalized output, various standards can be set for the scope of the structural variation report, such as the limit of the number of supports, the limit of the reported range, the limit of the direction of structural variation, etc. The report format can be divided into two categories: one The class is a tab-separated pattern, which is easy to read by humans, and the class is a binary compression format, which is easy for computer processing. You can also set up a companion tool module to further mine the results, and perform operations such as filtering based on existing results, regenerating reports, and merging support background pools.

输出的信息可以根据需要选择，包括但不限于：结构变异相关的融合基因、支持结构变异的分子数、支持参考型的分子数、结构变异频率、结构变异是否在COSMIC或ONCOKB中出现、融合基因状态、参与融合的两个基因间距离、染色体信息、断点位置、结构变异类型、结构变异大小、断点附近插入序列、融合掩码和/或断点附近保守序列，等等。The output information can be selected as needed, including but not limited to: fusion genes related to structural variation, number of molecules supporting structural variation, number of molecules supporting reference type, frequency of structural variation, whether structural variation occurs in COSMIC or ONCOKB, fusion genes Status, distance between two genes involved in fusion, chromosomal information, breakpoint location, type of structural variation, size of structural variation, insertion sequence near breakpoint, fusion mask and/or conserved sequence near breakpoint, etc.

本发明的检测方法还可以包含测序和/或获取测序数据的步骤。The detection method of the present invention may further comprise the step of sequencing and/or obtaining sequencing data.

本发明的技术方案可以在癌症的各种诊断和非诊断的应用场景中使用。本发明的技术方案可适用于任何分期的肿瘤，例如极早期肿瘤、早期肿瘤、中期肿瘤、晚期肿瘤；优选用于早期肿瘤或极早期肿瘤。The technical solutions of the present invention can be used in various diagnostic and non-diagnostic application scenarios of cancer. The technical solution of the present invention can be applied to tumors of any stage, such as very early stage tumors, early stage tumors, mid-stage tumors, and late stage tumors; it is preferably used for early stage tumors or very early stage tumors.

本发明的另一个目的是提供一种用于检测结构变异的系统或装置，所述系统或装置包括：Another object of the present invention is to provide a system or device for detecting structural variation, the system or device comprising:

1)测序模块和/或测序数据获取模块；和1) a sequencing module and/or a sequencing data acquisition module; and

2)识别模块，用于针对测序数据执行本发明的结构变异的检测方法。2) An identification module for performing the structural variation detection method of the present invention with respect to the sequencing data.

在一个具体实施方案中，所述系统或装置还包括：In a specific embodiment, the system or device further comprises:

3)注释模块，用于对融合基因进行注释；和/或3) An annotation module for annotating fusion genes; and/or

4)输出模块，用于输出检测结果。4) The output module is used to output the detection result.

本发明还提供一种计算机可读存储介质，所述计算机可读存储介质包括存储的计算机程序，所述计算机程序包含用于执行本发明的结构变异的检测方法的程序。The present invention also provides a computer-readable storage medium, the computer-readable storage medium comprising a stored computer program, the computer program comprising a program for executing the structural variation detection method of the present invention.

本发明还提供一种设备，所述设备包括处理器、存储器以及存储在所述存储器中的计算机程序，所述计算机程序包含用于执行本发明的结构变异的检测方法的程序。The present invention also provides an apparatus comprising a processor, a memory, and a computer program stored in the memory, the computer program comprising a program for performing the structural variation detection method of the present invention.

本发明的有益效果至少包括以下方面：The beneficial effects of the present invention include at least the following aspects:

(1)本发明的检测方法同时支持RNA和DNA的结构变异识别，灵敏度极高，特异性高，速度快，资源消耗小。只需要有一个支持变异的读段/读段对即可实现检测，而现有技术通常需要至少3条或者3对。这样的灵敏度可以实现质的飞跃，因为在实际的测序应用中，尤其是对低频异质性高的肿瘤样品的测序，往往难以得到足够的(至少3条)刚好跨越断点的、够长的读段以满足现有技术方法的要求。(1) The detection method of the present invention supports the identification of structural variation of RNA and DNA at the same time, with extremely high sensitivity, high specificity, high speed and low resource consumption. As long as there is one read/read pair that supports variation, the detection can be realized, while the existing technology usually requires at least 3 or 3 pairs. Such sensitivity can achieve a qualitative leap, because in practical sequencing applications, especially in the sequencing of tumor samples with low frequency and high heterogeneity, it is often difficult to obtain enough (at least 3) long enough samples that just span the breakpoint. Reads meet the requirements of prior art methods.

(2)针对基于RNA测序的结构变异计算，本发明采用了一种新颖的主转录本比对计算法，彻底避免了内含子对比对和计算的影响，取得了极好的效果。(2) For the calculation of structural variation based on RNA sequencing, the present invention adopts a novel primary transcript alignment calculation method, which completely avoids the influence of intron alignment and calculation, and achieves excellent results.

(3)采用本发明的方法，对断裂比对的读段和比对不一致的读段对进行分类时，对同一条读段仅仅访问一次即可完成分类，不依赖于该读段其他比对记录，并且对比对记录要求更严格，如果断裂比对部分比对到其他地方，则视为重复区，不予计算。(3) When using the method of the present invention, when classifying the read segment of the fracture alignment and the read segment with inconsistent alignment, the classification can be completed by only visiting the same read segment once, and does not depend on other comparisons of the read segment. Records, and the requirements for comparison records are more stringent. If the fractured comparison part is compared to other places, it will be regarded as a repeating area and will not be counted.

(4)对读段进行分类时，现有技术的方法往往是根据断裂位置排序，容易引起假阴和假阳性，而本发明的方法可有效避免这一问题。(4) When classifying the read segments, the prior art method often sorts according to the breaking position, which is easy to cause false negatives and false positives, and the method of the present invention can effectively avoid this problem.

(5)采用本发明的方法，对不同类别的断裂比对读段和不一致比对读段对进行分组时，避免了使用复杂的计算模型，简化了计算，而现有技术的方法确定断点的时候会构建复杂的节点、树并进行聚类，不仅运算复杂，且容易产生假阳性。(5) When adopting the method of the present invention, when grouping different types of break alignment reads and inconsistent alignment read pairs, the use of complex calculation models is avoided, and the calculation is simplified, and the method of the prior art determines breakpoints complex nodes, trees and clustering will be constructed, which is not only complex in operation, but also prone to false positives.

(6)采用本发明的方法对序列进行组装，能够有效提高断点判断的精确性。(6) Using the method of the present invention to assemble the sequence can effectively improve the accuracy of breakpoint judgment.

(7)本发明利用SA标签寻找发生断裂比对的读段和/或不一致比对读段对，对同一条读段仅仅访问一次即可完成分类，而现有技术中需要同时获取一条读段的两个比对记录才能进行计算，本发明的方法既节约了计算时间，又节约了比对文件的存储空间。(7) The present invention utilizes the SA tag to search for the read segment and/or the inconsistently aligned read segment pair, which can complete the classification by only accessing the same read segment once, while in the prior art, it is necessary to obtain a read segment at the same time The calculation can be carried out only if the two comparison records are obtained, and the method of the present invention not only saves the calculation time, but also saves the storage space of the comparison file.

(8)本发明的方法对比对记录的要求更严格，识别断裂比对的读段时，如果部分比对到其他地方，则视为重复区，不予计算。(8) The method of the present invention has stricter requirements on the recording, and when identifying the read segment of the fracture alignment, if part of the alignment is in other places, it is regarded as a repeating area and will not be counted.

(9)识别断裂比对的读段时，将来自同一个比对读段的断裂比对记录作为一个实体进行考虑。而现有技术中往往是根据断裂位置排序，不区分是否来源同一读段。本发明的方法有助于避免假阴性和假阳性的产生。(9) When identifying the reads of the break alignment, the break alignment records from the same aligned read are considered as one entity. However, in the prior art, the sequence is often based on the position of breakage, and does not distinguish whether the source of the same read is not. The methods of the present invention help to avoid false negatives and false positives.

附图说明Description of drawings

图1为本发明的结构变异检测方法的流程图。FIG. 1 is a flow chart of the structural variation detection method of the present invention.

具体实施方式Detailed ways

如无特别指明，本发明所使用术语均具有本领域通常的含义，所使用的试剂均为本领域常规商业化试剂。Unless otherwise specified, the terms used in the present invention have the usual meanings in the art, and the reagents used are all conventional commercial reagents in the art.

本发明中术语“参考序列”是指来源于参考基因组的DNA序列文件或者转录组的RNA序列文件中的序列。The term "reference sequence" in the present invention refers to a sequence derived from a DNA sequence file of a reference genome or an RNA sequence file of a transcriptome.

本发明中术语“正常比对读段(read)”是指一条读段与基因组比对的过程中，能够连续比对到基因组上的一段连续区域，如读段长度为m，则该读段能够比对到基因组上一个连续闭区间区域[a,b]，允许发生错配和短的插入缺失，但是开头和结尾没有剪切发生。In the present invention, the term "normally aligned read (read)" refers to a continuous region on the genome that can be continuously aligned in the process of aligning a read with the genome. If the read length is m, then the read Able to align to a continuous closed region [a,b] on the genome, allowing mismatches and short indels, but no splicing at the beginning and end.

本发明术语“发生断裂比对的读段”是指一条读段与基因组比对的过程中，读段的两部分比对到基因组的不连续区域，如读长度为m，10<n<m,读段[0,n]正常比对到基因组[a,b]，读段[n+1,m]正常比对到基因组[c,d],[a,b]，[c,d]两个区间要么属于不同的染色体，要么距离大于100。In the present invention, the term "a read segment with a split alignment" refers to the process of aligning a read segment with the genome, and two parts of the read segment are aligned to a discontinuous region of the genome. For example, the read length is m, 10<n<m ,Read [0,n] is normally aligned to genome [a,b], read [n+1,m] is normally aligned to genome [c,d],[a,b],[c,d] The two intervals either belong to different chromosomes or the distance is greater than 100.

本发明术语“不一致比对读段对”是指一条模版双端测序得到的一对读段，分别发生比对的时候，比对结果并不支持来自同一条正常模版，正常模版是指基因组上面的一段连续区域[a,b]。The term "inconsistently aligned read pair" in the present invention refers to a pair of reads obtained by paired-end sequencing of a template. When alignment occurs, the alignment results do not support the same normal template, and the normal template refers to the upper part of the genome. A continuous region [a,b] of .

关于本发明中术语“插入序列”，插入序列在不同语境下意思不同，在计算模版长度分布的时候，读段的插入序列是指的模版的长度，也就是一对读段比对的最小起始位置和最大终止位置差的绝对值；在计算结构变异的时候，是指在样品相对于参考基因组来讲，在某个位点多出了一段序列。Regarding the term "insertion sequence" in the present invention, the insertion sequence has different meanings in different contexts. When calculating the template length distribution, the insertion sequence of a read refers to the length of the template, that is, the minimum length of a pair of read alignments. The absolute value of the difference between the starting position and the maximum ending position; when calculating the structural variation, it means that there is an extra sequence at a certain position in the sample relative to the reference genome.

本发明中术语“最大插入片段”、“最大插入序列”，是指长度最长的插入序列(模版)。In the present invention, the terms "largest insert fragment" and "largest insert sequence" refer to the insert sequence (template) with the longest length.

本发明中术语“支持”表示因果关系，A支持B，表示现象表明B发生了。In the present invention, the term "support" means a causal relationship, A supports B, and a phenomenon indicates that B has occurred.

本发明中术语“结构变异事件”是指样品的基因组相对于正常的参考基因组序列发生了较大变化，具体包括倒位，插入，缺失，复制，转座。The term "structural variation event" in the present invention refers to a large change in the genome of a sample relative to a normal reference genome sequence, specifically including inversion, insertion, deletion, duplication, and transposition.

本发明中术语“保守序列”是指，一系列序列的集合，通过多重序列比对，比对结果中，提取支持数大于三条序列，且支持数最高的碱基作为某个位置的保守碱基，将所有保守碱基按照原来序列中相对位置连起来，就得到了保守序列。The term "conserved sequence" in the present invention refers to a collection of a series of sequences, through multiple sequence alignment, in the alignment result, the base with the support number greater than three sequences and the highest support number is extracted as the conserved base at a certain position , and connect all conserved bases according to their relative positions in the original sequence to obtain a conserved sequence.

本发明中术语“主转录本”是指人为选择的，同一个基因的数个转录本中，最长的，被研究较为透彻的，最具有临床价值的转录本。In the present invention, the term "main transcript" refers to the artificially selected, among several transcripts of the same gene, the longest, the most thoroughly studied and the most clinically valuable transcript.

本发明中术语“断裂点位置”是指结构变异事件的起始和终止位置，该位置是基因组或者转录组上的一个位置。The term "breakpoint position" in the present invention refers to the start and end position of a structural variation event, and the position is a position on the genome or transcriptome.

本发明中术语“参考型”是指和参考基因组或参考转录本一样的基因型。The term "reference type" in the present invention refers to the same genotype as the reference genome or reference transcript.

本发明中术语“变异型”是指和结构变异事件一样的变异类型。The term "variant type" in the present invention refers to the same variant type as a structural variant event.

本发明中术语“分子计数”是指按照模版条数多少来进行计数。The term "molecular counting" in the present invention refers to counting according to the number of template strips.

本发明中术语“模版”是指原始样品用于测序的DNA分子。The term "template" in the present invention refers to the DNA molecule of the original sample used for sequencing.

本发明中术语“比对位置较小”、“比对位置较大”是一对相对的概念，是对读段在参考基因组序列或参考主转录本序列上比对位置的度量，靠近5'的位置为较小，靠近3'的位置为较大。本发明中术语“坐标较小”、“坐标较大”是指，在同一个染色体或者转录本上面的坐标，靠近5'的位置为较小，靠近3'的位置为较大。In the present invention, the terms "smaller alignment position" and "larger alignment position" are a pair of relative concepts, and are a measure of the alignment position of a read on the reference genome sequence or reference master transcript sequence, close to the 5' The position is smaller, and the position close to 3' is larger. The terms "smaller coordinates" and "larger coordinates" in the present invention mean that the coordinates on the same chromosome or transcript are smaller at the position close to 5', and larger at the position close to 3'.

本发明中术语“上游剪切”、“下游剪切”是指，比对的时候，一条序列比对到参考基因组序列或参考主转录本序列上，有一部分发生了完全比对，另一部分被剪切掉，如果剪切发生在完全比对部分的上游，则为上游剪切，如果剪切发生在完全比对部分的下游，则为下游剪切。In the present invention, the terms "upstream splicing" and "downstream splicing" refer to when a sequence is aligned with the reference genome sequence or the reference master transcript sequence, a part of which is completely aligned, and the other part is Shear out, upstream shearing if shearing occurs upstream of the fully aligned section, and downstream shearing if shearing occurs downstream of the fully aligned section.

本发明中术语“大号染色体”、“小号染色体”是指，染色体在BAM文件内部映射成自然数，映射成自然数较大的是大号染色体，较小的是小号染色体。In the present invention, the terms "large chromosome" and "small chromosome" refer to the fact that chromosomes are mapped into natural numbers in the BAM file, and the larger chromosomes are mapped to the larger natural numbers, and the smaller chromosomes are small chromosomes.

下面将结合附图和具体实施例对本发明的技术内容作详细说明。本领域技术人员将会理解，以下实施例仅用于说明本发明，而不应视为限制本发明的范围。The technical content of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. Those skilled in the art will understand that the following examples are only used to illustrate the present invention and should not be construed as limiting the scope of the present invention.

实施例1Example 1

采集实体瘤患者和血液肿瘤患者的样品15000例，提取DNA进行双端二代捕获建库测序，部分样品提取RNA进行捕获测序。测序得到的各个样品文库，对于DNA文库，使用BWA比对软件与人类参考基因组HG19比对；对于RNA文库，使用BWA比对软件与个性化构建的转录组比对。转录组通过如下方式构建：15,000 samples from patients with solid tumors and hematological tumors were collected, DNA was extracted for paired-end second-generation capture and library construction, and RNA was extracted from some samples for capture and sequencing. For each sample library obtained by sequencing, for DNA library, use BWA alignment software to align with the human reference genome HG19; for RNA library, use BWA alignment software to align with the personalized constructed transcriptome. The transcriptome is constructed by:

每个基因选取一个转录本作为主转录本，主转录本要求优先被COSMIC和ONCOKIB数据库收录，如果没有收录，则要求是UCSC中主转录本数据库中记录的一个，如果这三个数据库中均无确定主转录本，则使用最长的一条作为主转录本。Select one transcript for each gene as the master transcript. The master transcript must be preferentially included in the COSMIC and ONCOKIB databases. If it is not included, it must be the one recorded in the master transcript database in UCSC. If the main transcript is determined, the longest one is used as the main transcript.

比对得到的结果去掉PCR重复并纠错，生成用于结构变异检测的BAM文件。BAM文件作为本发明方法的输入文件，不一致比对读段对种子数最小为2，断裂比对读段种子数最小为1，运行内存设定为8G，线程设为8，集群投递分析。The results obtained by the comparison are removed from PCR duplicates and corrected for errors, and a BAM file for structural variation detection is generated. The BAM file is used as the input file of the method of the present invention. The minimum number of seeds for inconsistent comparison reads is 2, the minimum number of seeds for broken comparison reads is 1, the running memory is set to 8G, the thread is set to 8, and the cluster is delivered for analysis.

实施例2Example 2

遍历比对结果文件，取100万条正常比对的读段(提取标准为没有断裂比对发生，且读段对符合正常读段，不支持任何结构变异)统计插入片段长度，计算插入片段长度分布的主要参数，最大值，最小值，均值。Traverse the alignment result file, take 1 million normal aligned reads (the extraction standard is that no fragmentation alignment occurs, and the read pair conforms to normal reads, and does not support any structural variation) to count the length of the inserted fragment and calculate the length of the inserted fragment The main parameters of the distribution, maximum, minimum, mean.

再次遍历比对结果文件，访问所有读段，找出所有发生断裂比对的读段和比对不正常的读段对，根据以下特征分别对发生断裂比对的读段和比对不正常的读段对进行分类：断裂比对的读段主要根据发生断裂比对的两部分比对情况进行分类，分类标准包括，是否比对到同一条染色体，是否比对到基因组的不同方向，是否剪切位置都在比对记录的上游，详细分类表见表1；比对不正常的读段对分类标准主要包括，是否比对到同一条染色体，是否比对到基因组的不同方向，插入片段长度大小，详细分类表见表2。Traverse the alignment result file again, access all the reads, find out all the reads with broken alignment and the read pairs with abnormal alignment, and analyze the reads with broken alignment and the abnormal alignment according to the following characteristics. Classification of read pairs: The reads of the break alignment are mainly classified according to the alignment of the two parts of the break alignment. The classification criteria include whether they are aligned to the same chromosome, whether they are aligned to different directions of the genome, and whether they are cut or not. The cut positions are all upstream of the alignment record, and the detailed classification table is shown in Table 1; the classification criteria of the reads with abnormal alignment mainly include whether the alignment is on the same chromosome, whether the alignment is in different directions of the genome, and the length of the inserted fragment. Size, detailed classification table see Table 2.

表1 断裂比对读段分类表Table 1 Breakdown alignment read classification table

表2 不一致比对读段对分类表Table 2 Discordant alignment read pair classification table

对不同类别的断裂比对读段和不一致比对读段对分别进行聚类分析，聚类的目的是将支持同一个结构变异事件的读段放到一个集合里面，得到较为精确的断点，尽可能减小比对和测序错误造成的断点随机误差。Clustering analysis is performed on different types of break alignment reads and inconsistent alignment reads. The purpose of clustering is to put reads that support the same structural variation event into a set to obtain more accurate breakpoints. Minimize random errors in breakpoints caused by alignment and sequencing errors.

对断裂比对的读段聚类的方法如下：对断裂比对的读段进行排序，排序标准包括“结构变异类型，断裂比对读段两部分比对的参考序列名字，第一个断裂点位置，第二个断裂点位置”；将结构变异类型相同，比对参考序列名相同，两个断裂位置均在10个碱基以内的读段当作一类。如果一类中的读段支持数高于预先设定的阈值(本实施例中具体设定为1，考虑到探针设计和测序打断可能无法完美测到很多跨越断点的读段，该阈值最小可选择为1，能达到极高的灵敏度)，则使用该类中的所有读段断点位置信息取均值得到平均断点位置，产生一个断裂读段支持的结构变异事件，与该均值断点最大误差和最小误差作为断点误差范围。The method of clustering the reads of the break alignment is as follows: Rank the reads of the break alignment, and the ranking criteria include "structural variant type, the reference sequence name of the two-part alignment of the break alignment reads, and the first break point. position, the second breakpoint position"; the reads with the same type of structural variation, the same name of the alignment reference sequence, and both break positions within 10 bases are regarded as a class. If the read support number in a class is higher than a preset threshold (specifically set to 1 in this example, considering that probe design and sequencing interruption may not be able to perfectly detect many reads spanning the breakpoint, the The minimum threshold can be selected as 1, which can achieve extremely high sensitivity), then the average breakpoint position is obtained by taking the average value of the breakpoint position information of all reads in this class, and a structural variation event supported by the broken read is generated, which is the same as the average value. The breakpoint maximum error and minimum error are used as the breakpoint error range.

不一致比对读段对的聚类方法如下：对不一致比对读段进行排序，排序标准包括“结构变异类型，读段比对的参考序列名，配对读段比对的参考序列名，读段比对位置，配对读段比对的位置”；将相同比对参考序列名，比对位置差别在最大插入片段大小范围以内的读段对当作一类；如果一类中的读段对支持数高于预先设定的阈值(本实施例中具体设定为2，该值可根据实际测序质量进行调整，最低为1，可达到极大灵敏度，建议设置为2，因为不一致比对读段对总数往往不能得到比初始证据收集阶段更高的值，设置太低会产生较大计算量)，则使用该类中的读段对所确定的断点范围来估计断点位置，产生一个由不一致比对读段对支持的结构变异事件。The clustering method for inconsistently aligned read pairs is as follows: Rank the inconsistently aligned reads, and the ranking criteria include "structural variation type, read alignment reference sequence name, paired read alignment reference sequence name, read Alignment position, the position of the paired read alignment"; the read pairs with the same alignment reference sequence name and the alignment position difference within the range of the maximum insert size are regarded as a class; if the read pairs in a class support The number is higher than the preset threshold (specifically set to 2 in this example, the value can be adjusted according to the actual sequencing quality, the minimum is 1, which can achieve a great sensitivity, it is recommended to set it to 2, because the reads are inconsistently compared For the total number, it is often impossible to obtain a higher value than the initial evidence collection stage, and if the setting is too low, it will result in a large amount of computation), then use the breakpoint range determined by the read pairs in this class to estimate the breakpoint position, and generate a Structural variation events supported by discordantly aligned read pairs.

对于由断裂比对读段确定的结构变异事件，通过对支持该结构变异事件的读段进行组装得到一条保守序列，组装通过多重序列比对技术进行，保守序列可以减小由于测序错误导致的小的插入缺失或者单核苷酸突变对断点判断的影响。序列组装的过程中，对参与组装的序列进行了解析重构，主要包括：提取出断点附近的短的插入系列，插入序列会影响和参考序列的比对，影响断点精确判断；对序列方向进行调整，尽可能保证读段5’端与参考序列方向一致，便于后期统一处理和断点计算。For the structural variation event determined by the fragmented alignment reads, a conserved sequence is obtained by assembling the reads supporting the structural variation event, and the assembly is performed by multiple sequence alignment technology. The conserved sequence can reduce the small size caused by sequencing errors. The effect of indels or single nucleotide mutations on breakpoint judgment. In the process of sequence assembly, the sequences involved in the assembly were analyzed and reconstructed, mainly including: extracting short insertion series near the breakpoint, the insertion sequence will affect the alignment with the reference sequence and affect the accurate judgment of the breakpoint; Adjust the direction to ensure that the 5' end of the read is in the same direction as the reference sequence as much as possible, which is convenient for unified processing and breakpoint calculation in the later stage.

得到保守序列以后与参考序列进行比对，根据断裂比对发生的位置，确定精确的断点位置(如果没有合适的断裂比对产生，则该结构变异事件并不是一个可靠的事件，直接丢弃，不进行后续分析，这往往是由于测序错误或者比对误差多导致的假阳性)。得到断点以后，取跨越断点上下游各10个碱基的短序列，作为探针(后续计算结构变异事件频率的时候，会再一次遍历原始BAM文件，从原始BAM文件中提取更多跨越断点的读段，这些读段可能是发生了断裂比对，并已经被用来计算结构变异事件的种子读段，也可能是断裂部分较短，前期比对生成BAM的过程中没有成功的进行断裂比对，通过与探针比对，则能知道这些发生断裂的读段是否支持相应的结构变异事件)，进而对频率有更精确的计算值。对于涉及有插入序列的情况，保留两份探针，一份含有插入序列，一份不含插入序列。After the conserved sequence is obtained, it is compared with the reference sequence, and the exact breakpoint position is determined according to the position where the break alignment occurs (if no suitable break alignment is generated, the structural variation event is not a reliable event and is directly discarded. Subsequent analyses were not performed, which were often false positives due to sequencing errors or excessive alignment errors). After obtaining the breakpoint, take a short sequence spanning 10 bases upstream and downstream of the breakpoint as a probe (when calculating the frequency of structural variation events later, the original BAM file will be traversed again, and more spans will be extracted from the original BAM file. Reads with breakpoints. These reads may be seed reads that have undergone break alignment and have been used to calculate structural variation events, or may be short breaks and failed to generate BAM in the previous alignment. By performing a fragmentation alignment, by aligning with the probe, it is possible to know whether these fragmented reads support the corresponding structural variation events), and then have a more accurate calculation of the frequency. For cases involving inserts, two copies of the probe are kept, one with the insert and one without the insert.

参考序列由两部分组成，一部分来自于跨越结构变异事件的一个断点的参考序列，另一部分来自于跨越结构变异事件的另一个断点的参考序列，二者连接的方式(包括何者在5’端，何者在3’端，是正向序列参与连接还是反向互补参与连接)由对应支持该结构变异事件的断裂读段比对情况决定。具体如下，I型倒位类结构变异事件的参考序列，上游部分来自于跨越坐标较小断点的参考序列，下游部分来自于跨越坐标较大断点的参考序列，上游正向参与链接，下游反向互补参与连接；II型倒位类结构变异事件的参考序列，上游部分来自跨越坐标较小断点的参考序列，下游部分来自跨越坐标较大断点的参考序列，上游以反向互补的方式参与连接，下游正向参与连接；缺失型结构变异事件的参考序列，上游部分来自跨越坐标较小断点的参考序列，下游部分来自跨越坐标较大断点的参考序列，上下游两侧均以正向序列参与连接；复制型结构变异事件的参考序列，上游部分来自跨越坐标大的参考序列，下游部分来自跨越坐标较小断点的参考序列，上下游两侧均以正向序列参与连接；插入型结构变异事件的参考序列则直接取自跨越插入位点的一条参考序列正链序列；两条染色体5’端和5’端相连的结构变异事件，上游部分来自于跨越染色体号较大的断点的参考序列，下游部分来自于跨越染色体号较小的断点的参考序列，上游正向参与链接，下游反向互补参与连接；两条染色体3’端和3’端相连的结构变异事件，上游部分来自于跨越染色体号较大的断点的参考序列，下游部分来自于跨越染色体号较小的断点的参考序列，上游以反向互补的方式参与连接，下游正向参与连接；大号染色体5’端和小号染色体3’端相连的结构变异事件，上游部分来自于跨越染色体号较大的断点的参考序列，下游部分来自于跨越染色体号较小的断点的参考序列，上下游均以正向连接；大号染色体3’端和小号染色体5’端相连的结构变异事件，上游部分来自于跨越染色体号较大的断点的参考序列，下游部分来自于跨越染色体号较小的断点的参考序列，上下游均以正向连接。The reference sequence consists of two parts, one from the reference sequence spanning one breakpoint of the structural variation event, and the other from the reference sequence spanning the other breakpoint of the structural variation event, and the manner in which the two are connected (including which one is in the 5' end, which is at the 3' end, whether the forward sequence participates in the ligation or the reverse complement participates in the ligation) is determined by the alignment of the corresponding fragmented reads that support the structural variation event. The details are as follows, the reference sequence of the type I inversion class structural variation event, the upstream part comes from the reference sequence spanning the breakpoint with smaller coordinates, the downstream part comes from the reference sequence spanning the breakpoint with larger coordinates, the upstream is positively involved in the link, the downstream part is Reverse complement is involved in ligation; the reference sequence for type II inversion class structural variation events, the upstream part comes from the reference sequence spanning the smaller breakpoint, the downstream part comes from the reference sequence spanning the larger breakpoint, and the upstream is reverse complementary. The reference sequence of the deletion-type structural variation event, the upstream part comes from the reference sequence spanning the smaller coordinate breakpoint, and the downstream part comes from the reference sequence spanning the larger coordinate breakpoint, both upstream and downstream are both. The forward sequence is involved in the connection; the reference sequence of the replicative structural variation event, the upstream part comes from the reference sequence that spans the larger coordinate, and the downstream part comes from the reference sequence that spans the smaller coordinate breakpoint, and both upstream and downstream sides participate in the connection with the forward sequence ; the reference sequence of the insertion structural variation event is directly taken from the positive strand sequence of a reference sequence spanning the insertion site; the structural variation event that connects the 5' and 5' ends of the two chromosomes, the upstream part comes from the larger spanning chromosome number The reference sequence of the breakpoint, the downstream part comes from the reference sequence of the breakpoint spanning the smaller chromosome number, the upstream forward is involved in the link, and the downstream reverse complement is involved in the link; the structural variation of the 3' and 3' ends of the two chromosomes is connected Events, the upstream part comes from the reference sequence spanning the breakpoint with the larger chromosome number, the downstream part comes from the reference sequence spanning the breakpoint with the smaller chromosome number, the upstream is involved in the connection in a reverse complementary manner, and the downstream is involved in the forward connection; Structural variation events connecting the 5' end of the large chromosome to the 3' end of the small chromosome, the upstream part comes from the reference sequence spanning the breakpoint with the larger chromosome number, and the downstream part comes from the reference sequence spanning the breakpoint with the smaller chromosome number , the upstream and downstream are connected in a forward direction; the structural variation event connecting the 3' end of the large chromosome and the 5' end of the small chromosome, the upstream part comes from the reference sequence spanning the larger breakpoint of the chromosome number, and the downstream part comes from the spanning chromosome The reference sequences of the smaller breakpoints are connected upstream and downstream in a forward direction.

经过断点位置精细化处理之后，有一部分结构变异事件，断点相近，类型相同，本身是来自于同一个结构变异是事件，只是测序误差或者比对误差导致了两个事件产生，此时只需要合并保留支持数较高者即可。对于断裂读段支持的结构变异，如果断点距离在二者任何一个的同源序列长度之内或者小于10个碱基，进行合并。对于比对不一致读段支持的结构变异，如果断点距离在最大插入序列长度范围之内，且同类型结构变异，进行合并保留较高支持者。After the refinement of the breakpoint locations, there are some structural variation events with similar breakpoints and the same type, which themselves are events from the same structural variation, but the sequencing error or alignment error caused two events to occur. At this time, only You need to merge and retain the higher number of supports. Structural variants supported by the split reads were merged if the breakpoint distance was within the length of either homologous sequence or less than 10 bases. For the structural variants supported by the inconsistent reads in the alignment, if the breakpoint distance is within the range of the maximum insertion sequence length, and the structural variants of the same type are merged, the higher supporters are retained.

由断裂比对读段确定的结构变异事件，往往也有不一致比对读段支持，在本实施例的前期计算中，这两种事件是分开的，为了加强证据，此时将二者合并。合并的标准为，如果一个断裂读段支持的结构变异事件其断点刚好和另一个同类型不一致读段支持的结构变异事件断点距离在最大插入序列长度之内，则进行合并。对于无法合并的由不一致比对读段对支持的结构变异事件，直接保留。Structural variation events determined by fragmentation alignment reads are often supported by inconsistent alignment reads. In the previous calculation of this example, these two events were separated. To strengthen the evidence, the two events were merged at this time. The criterion for merging is that if the breakpoint of a structural variation event supported by a fragmented read is just within the maximum insertion sequence length from the breakpoint of a structural variation event supported by another discordant read of the same type, the merge will be performed. Structural variant events supported by discordantly aligned read pairs that could not be merged were retained directly.

对于有断裂比对读段支持的结构变异事件，有一部分假阳性来自于重复区域的读段，这些读段并不能够明确究竟来自于基因组哪一个确定的位置，其保守序列往往也能够比对到基因组多个位置，且前期确定断点的时候采取的是小范围的比对，精确度到受到一定限制，为了对断点位置进一步精确化，与参考基因组进行全局比对，标记出重复区域的结构变异事件。For the structural variation events supported by the fragmented alignment reads, some false positives come from the reads in the repetitive region. These reads cannot be identified from a certain position in the genome, and their conserved sequences can often be aligned. To multiple locations in the genome, and the early determination of breakpoints was performed in a small range of alignments, the accuracy was limited to a certain extent. In order to further refine the location of the breakpoints, a global alignment with the reference genome was performed to mark the repetitive regions. structural variation events.

比对使用的是BWT算法比对，如果保守序列能够完整的连续匹配到基因组上一段序列，或者能够发生多处一致性比对，则该结构变异事件并不是一个真的结构变异事件。操作方案如下：如果保守序列能够在参考基因组序列或参考主转录本序列上能够完整连续匹配，则对改组记录进行记录并滤除；如果保守序列能够在参考基因组序列或参考主转录本序列上发生断裂比对，且断裂比对确定的断点位置与之前计算的断点位置差别在10bp以内，并且这样的断裂比对超过2组，则对该组记录进行记录并滤除；如果这样的断裂比对为1组，则根据断点位置进一步精确断点；如果这样的断裂比对小于1组，则不进行标记和修改。The alignment is performed using the BWT algorithm. If the conserved sequence can be completely and continuously matched to a sequence on the genome, or multiple consistent alignments can occur, the structural variation event is not a true structural variation event. The operation scheme is as follows: if the conserved sequence can be completely and contiguously matched on the reference genome sequence or the reference master transcript sequence, the shuffling record is recorded and filtered; if the conserved sequence can occur on the reference genome sequence or the reference master transcript sequence Break alignment, and the breakpoint position determined by break alignment is within 10 bp from the previously calculated breakpoint position, and such break alignment exceeds 2 groups, the record of this group is recorded and filtered; if such break If the alignment is 1 group, the breakpoint is further refined according to the position of the breakpoint; if such a break alignment is less than 1 group, no marking and modification are performed.

得到断点进一步精确化的结构变异事件以后，对结构变异事件的频率进行计算，计算一个结构变异事件的频率，需要计算相关的两个指标，一个是结构变异事件两个断点的参考型计数，另一个是结构变异事件的变异型计数。为了避免传统做法中以读段数作为变异型计数所导致的计数偏高，本实施例中以分子进行计数。对于参考型的计数同样也采用分子计数，并且采用断点两侧支持数较高的参考型作为该结构变异事件的参考型计数。After obtaining the structural variation events with further refined breakpoints, the frequency of structural variation events is calculated. To calculate the frequency of a structural variation event, two related indicators need to be calculated. One is the reference count of the two breakpoints of the structural variation event. , and the other is the variant count of structural variant events. In order to avoid the high count caused by using the number of reads as the variant count in the traditional practice, in this embodiment, the number of molecules is counted. Molecular counts are also used for the counts of the reference type, and the reference type with higher support numbers on both sides of the breakpoint is used as the reference type count for the structural variation event.

对于跨越断点的读段和读段对是否支持参考型，根据对不同结构变异读段分类确定(参见表1和表2)，不支持任何变异型的当作参考型计数。Whether the reads and read pairs spanning the breakpoint support the reference type is determined according to the classification of reads of different structural variants (see Table 1 and Table 2), and any variants that do not support are counted as the reference type.

结构变异频率等于支持结构变异的分子数除以支持结构变异的分子数与支持参考型的分子数之和。The structural variant frequency is equal to the number of molecules supporting the structural variant divided by the sum of the number of molecules supporting the structural variant and the number of molecules supporting the reference type.

实施例3Example 3

本发明的结构变异检测方法能够同时支持RNA和DNA结构变异的识别，而现有技术的方法很少能做到同时支持这两种识别，例如DELLY和GeneFuse均仅支持DNA结构变异识别，TophatFusion仅支持RNA结构变异识别。为了测试本发明方法的性能，将其与能够同时支持DNA和RNA结构变异识别的商业软件FusionMap进行比较。The structural variation detection method of the present invention can support the identification of RNA and DNA structural variation at the same time, while the prior art methods seldom support both identifications at the same time. For example, both DELLY and GeneFuse only support the identification of DNA structural variation, and TophatFusion only supports Supports RNA structural variant identification. To test the performance of the method of the present invention, it was compared with FusionMap, a commercial software capable of supporting both DNA and RNA structural variant identification.

从实施例1的测序数据中提取269个阳性样品，每个样品有1-2个经过实验验证真实存在的融合。分别使用DNA建库测序分析的方法与FusionMap软件进行结构变异检测，结果显示，采用实施例2的方法能够达到100％的检出率(见表3)，采用FusionMap软件，也能100％检测出，说明本软件能够很好的检测出阳性融合，且每个样品检测出的融合数极少，特异性有很高保障，而FusionMap虽然也能100％检测出这些样品的融合，但是会报出其他不可信融合(FusionMap每个样品平均检测出3个以上融合)，可见相较于FusionMap，本发明的方法具有更高的特异性。269 positive samples were extracted from the sequencing data of Example 1, and each sample had 1-2 fusions that were experimentally verified to exist. The method of DNA library building and sequencing analysis and FusionMap software were used to detect structural variation. The results showed that the method of Example 2 could achieve a 100% detection rate (see Table 3), and FusionMap software could also be used to detect 100%. , indicating that the software can detect positive fusions well, and the number of fusions detected in each sample is very small, and the specificity is highly guaranteed. Although FusionMap can also detect 100% of the fusions of these samples, it will report For other unreliable fusions (more than 3 fusions are detected on average per sample in FusionMap), it can be seen that the method of the present invention has higher specificity than FusionMap.

表3 本发明检测方法的阳性样品分析结果Table 3 Positive sample analysis results of the detection method of the present invention

测试样品个数Number of test samples 269个269 检出率The detection rate 100％100% 报出融合数/样品Report the number of fusions/sample 平均少于2个融合/样品Average of less than 2 fusions/sample

进一步使用原始数据质量较差的样品和文库比较复杂的样品共18例进行了测试，18个困难样品通过实施例2的方法全部得到了识别，而在FusionMap软件中未获识别(见表4)。可见与商业付费软件FusionMap相比，本发明的方法对原始数据质量较差或者文库比较复杂的样品均有更好的检出。A total of 18 samples with poor quality raw data and samples with more complex libraries were used for further testing. All 18 difficult samples were identified by the method of Example 2, but were not identified in the FusionMap software (see Table 4). . It can be seen that compared with the commercial paid software FusionMap, the method of the present invention can better detect samples with poor original data quality or complex libraries.

表4 复杂或者低频阳性样品分析结果Table 4 Analysis results of complex or low frequency positive samples

测试样品个数Number of test samples 18个18 pcs 实施例2方法检出样品个数The number of samples detected by the method of embodiment 2 18个(100％检出)18 (100% detected) FusionMap检出样品个数Number of samples detected by FusionMap 0个0

进一步扩大测试规模，从实施例1的测序数据中提取8000个样品进行测试，先采用FusionMap进行检测，共出现98个漏检的可信融合样品。采用实施例2的方法对FusionMap漏检的样品进行检测，98个漏检样品全部得到检出(表5)。The test scale was further expanded, 8000 samples were extracted from the sequencing data of Example 1 for testing, and FusionMap was used for detection first, and a total of 98 credible fusion samples were missed. The samples missed by FusionMap were detected by the method of Example 2, and all 98 missed samples were detected (Table 5).

表5 临床样品分析结果Table 5 Analysis results of clinical samples

测试样品个数Number of test samples 8000个8000 FusionMap漏检可信融合FusionMap missed detection and trusted fusion 98个98 实施例2方法对漏检样品的检出Detection of missed samples by the method of embodiment 2 98个98

另外，本发明方法之所以能取得高灵敏度的检测结果，至少可以部分基于本发明方法的原理，其在寻找可能的结构变异事件初始证据时，可以做到仅仅一条断裂比对读段或者仅仅一对不一致比对的读段对支持的情况下的计算，而作为对照的FusionMap软件需要至少2条断裂比对读段才能识别出结构变异，另一款常用的开源软件DELLY需要至少3条断裂比对读段或者3对不一致比对读段对才能启动结构变异计算，因此本发明的检测方法与传统方法相比具有更高的灵敏度。In addition, the reason why the method of the present invention can obtain high-sensitivity detection results can be at least partially based on the principle of the method of the present invention, when searching for the initial evidence of possible structural variation events, only one fragmentation alignment read or only one fragment can be achieved. Calculations in the case of the support of inconsistently aligned read pairs, while the FusionMap software as a control requires at least 2 break alignment reads to identify structural variants, and another commonly used open source software, DELLY, requires at least 3 break ratios. Structural variation calculation can only be started from the reads or three pairs of discordantly aligned read pairs, so the detection method of the present invention has higher sensitivity than traditional methods.

前述测试项目已经部分展现了本发明方法的高特异性，接下来进一步扩大测试规模，以实施例1中的全部15000例样品进行分析，采用实施例2的方法对每个样品得到了1-2个可信结构变异，而商业软件FusionMap的检测结果中出现了大量5个以上的结构变异，DELLY出现了若干具有数十个、上百个结构变异的检测结果。一般而言，1-2个结构变异是可信的，多出来的结构变异结果往往是不可靠的，甚至没有临床意义的。这一结果表明本发明的方法具有更高的特异性。The aforementioned test items have partially demonstrated the high specificity of the method of the present invention. Next, the scale of the test is further expanded to analyze all 15,000 samples in Example 1. Using the method of Example 2, 1-2 samples were obtained for each sample. However, there are a large number of more than 5 structural variants in the test results of the commercial software FusionMap, and DELLY has several test results with dozens or hundreds of structural variants. Generally speaking, 1-2 structural variants are credible, and the results of additional structural variants are often unreliable or even clinically meaningful. This result indicates that the method of the present invention has higher specificity.

为了测试本发明方法的速度和资源消耗情况，采用实施例2的方法对实施例1中15000个样品进行了测试。平均分析时间为3分钟/样品(见表6)，平均内存使用为6G/样品，线程为8/样品，一般的开源软件或者商业软件至少需要半个小时或者更久，资源消耗也更多。可见本发明的方法与现有技术相比，具有更高的速度和更少的资源消耗量。In order to test the speed and resource consumption of the method of the present invention, the method of Example 2 was used to test 15,000 samples in Example 1. The average analysis time is 3 minutes/sample (see Table 6), the average memory usage is 6G/sample, and the thread is 8/sample. General open source software or commercial software takes at least half an hour or longer, and the resource consumption is also more. It can be seen that compared with the prior art, the method of the present invention has higher speed and less resource consumption.

表6 资源消耗分析Table 6 Resource consumption analysis

测试样品testing sample 15000样品15000 samples 平均内存消耗Average memory consumption 6G/样品6G/sample 线程数Threads 8/样品8/sample 平均分析时间Average analysis time 3分钟/样品3 minutes/sample

最后需要说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，但本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: It is still possible to modify the technical solutions recorded in the foregoing embodiments, or perform equivalent replacements to some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. range.

Claims

1. A method for detecting structural variation, the method comprising:

1) aligning the sequencing data to a reference genomic sequence or a reference master transcript sequence;

2) searching a normal comparison read (read), a read subjected to fracture comparison and an inconsistent comparison read pair;

3) classifying the read with fracture comparison and the read with inconsistency comparison;

4) grouping the fracture comparison reads of different types and the inconsistent comparison reads respectively, and classifying the reads supporting the same structural variation event into the same set;

5) for structural variation events determined by the fragmentation alignment reads, conserved sequences are formed by assembling the reads that support the structural variation events;

6) determining an exact breakpoint location based on the conserved sequence;

7) merging the structural variation supported by the fracture comparison read and the structural variation supported by the inconsistent comparison read, wherein merging refers to merging the structural variation events with similar breakpoints and the same type into the same structural variation event;

8) merging the structural variation supported by the fracture comparison read with similar breakpoints and the same type with the structural variation supported by the inconsistent comparison read;

9) the deletion of the conserved sequence can be completely and continuously matched with a segment of sequence on the genome or can generate a plurality of structural variation events which are aligned in a consistent way;

10) calculating the frequency of structural variation events.

2. The detection method according to claim 1, wherein the step 2) further comprises the steps of counting the lengths of the inserts by the normal alignment reads and calculating the main parameters of the distribution of the lengths of the inserts; the main parameter is preferably a maximum, a minimum and/or a mean.

3. The detection method according to any one of claims 1-2, wherein the SA tags are used to find the reads that are aligned for fragmentation and/or the read pairs that are aligned for inconsistency.

4. The method according to any one of claims 1 to 3, wherein in step 8), the merging condition comprises that the breakpoint of one split aligned read supports a structural variant event is within the maximum insertion sequence length of the breakpoint of another non-aligned read of the same type to the supported structural variant event.

5. The detection method according to any one of claims 1 to 4, wherein in step 9), if the conserved sequence can be completely and continuously matched with a segment of sequence on the genome, the record of the group is recorded and/or filtered; if the conserved sequence can be subjected to fragmentation alignment on a reference genome sequence or a reference major transcript sequence, the breakpoint position determined by the fragmentation alignment is within 10bp of the breakpoint position calculated before, and the fragmentation alignment exceeds 2 groups, recording and/or filtering the group of records; if the fracture ratio is 1 group, further accurate breakpoint is carried out according to the breakpoint position; if such fragmentation ratio is less than 1 group, no labeling and/or modification is performed.

6. The method according to any one of claims 1 to 5, wherein the structural variant event frequency is calculated in step 10) by counting the number of reference types and the number of variant types at two breakpoints of the structural variant event, wherein the reference type is of the type corresponding to the reference genomic sequence or the reference master transcript sequence, and the variant type is of the type corresponding to the structural variant event sequence; preferably, the reference pattern with higher support number on both sides of the breakpoint is used as the reference pattern count of the structural variation event.

7. The detection method according to claim 6, wherein the structural variation event frequency is the number of molecules supporting structural variation/(number of molecules supporting structural variation + number of molecules supporting reference type).

8. A system or apparatus for detecting structural variations, the system or apparatus comprising:

1) a sequencing module and/or a sequencing data acquisition module; and

2) an identification module for performing a method of detecting a structural variation according to any one of claims 1-7 on sequencing data.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored computer program containing a program for executing the method for detecting a structural variation according to any one of claims 1 to 7.

10. An apparatus comprising a processor, a memory, and a computer program stored in the memory, the computer program comprising a program for performing the method of detecting a structural variation according to any one of claims 1-7.