JPWO2016114009A1

JPWO2016114009A1 - Fusion gene analysis apparatus, fusion gene analysis method, and program

Info

Publication number: JPWO2016114009A1
Application number: JP2016569243A
Authority: JP
Inventors: 一哉土原; 慎吾松本; 幸代三牧
Original assignee: National Cancer Center Japan
Current assignee: National Cancer Center Japan
Priority date: 2015-01-16
Filing date: 2015-11-24
Publication date: 2017-11-02
Anticipated expiration: 2035-11-24
Also published as: JP6691871B2; WO2016114009A1

Abstract

シーケンサから出力されるリード配列を取得するリード配列取得部と、取得したすべてのリード配列について相補配列を作成し、仮想相補配列として出力する仮想相補配列生成部と、リード配列と仮想相補配列をマッピング装置に供給し、マッピング装置による参照配列上へのマッピング結果を取得するマッピング情報取得部と、マッピング結果において、リード配列および対応する仮想相補配列が、それぞれ参照配列の２か所に分割されてマッピングされているものを候補リード配列として抽出し、抽出された候補リード配列の分割点をブレークポイント候補とする、候補リード配列抽出部と、ブレークポイント候補が所定の塩基数以内の近傍にある候補リード配列を１つのグループにまとめるグループ作成部と、グループを構成する候補リード配列の特徴や数に基づいて、各々のグループに含まれる候補リード配列が融合遺伝子由来のものであるか否かを判断するための情報を生成する、融合遺伝子判断情報生成部と、を備える。A lead sequence acquisition unit that acquires the read sequence output from the sequencer, a complementary sequence generation unit that generates complementary sequences for all the acquired read sequences and outputs them as virtual complementary sequences, and mapping the read sequences and virtual complementary sequences A mapping information acquisition unit that supplies the device and acquires a mapping result on the reference sequence by the mapping device; in the mapping result, the lead sequence and the corresponding virtual complementary sequence are each divided into two portions of the reference sequence and mapped A candidate lead sequence extraction unit that extracts a candidate read sequence as a candidate lead sequence, and uses a division point of the extracted candidate lead sequence as a breakpoint candidate; Group creation unit for grouping arrays into one group, and candidates for configuring the group A fusion gene determination information generation unit that generates information for determining whether or not the candidate lead sequences included in each group are derived from a fusion gene based on the characteristics and number of card sequences; Prepare.

Description

本発明は、融合遺伝子解析装置、融合遺伝子解析方法、及びプログラムに関する。 The present invention relates to a fusion gene analyzer, a fusion gene analysis method, and a program.

近年、がん治療において融合遺伝子が注目されている。例えば、特許文献１には、シーケンシング機から得られた患者のゲノムデータを、ネットワーク上に分散したデータソースなどを利用して解析し、癌などに関わる突然変異の位置や、突然変異の結果として生じる疾病についての情報を提供するシステムが開示されている。 In recent years, fusion genes have attracted attention in cancer treatment. For example, in Patent Document 1, genome data of a patient obtained from a sequencing machine is analyzed using a data source distributed on a network, and the position of a mutation related to cancer and the result of the mutation are disclosed. As a system for providing information on the disease that occurs as a result.

特開２０１４−１４６３１８号公報JP 2014-146318 A

しかしながら、特許文献１に記載された方法では、シーケンシング機から得られるゲノムデータを特に絞り込むことなく解析しているため、解析に長い時間がかかるという問題点があった。また、シーケンシングやアラインメントの段階での解析エラーを排除する処理もないため、解析の精度も十分とはいえなかった。また、特許文献１には、融合遺伝子の検出と抽出については十分に記載されていない。 However, the method described in Patent Document 1 has a problem in that it takes a long time to analyze the genome data obtained from the sequencing machine without any particular restriction. Also, since there is no processing to eliminate analysis errors at the sequencing or alignment stage, the accuracy of analysis is not sufficient. Further, Patent Document 1 does not fully describe detection and extraction of a fusion gene.

そこで本発明の目的は、融合遺伝子解析の精度向上と時間短縮を実現することである。 Therefore, an object of the present invention is to realize improvement in accuracy and time reduction in fusion gene analysis.

本発明に係る融合遺伝子解析システムは、シーケンサから出力されるリード配列を取得するリード配列取得部と、取得したすべてのリード配列について相補配列を作成し、仮想相補配列として出力する仮想相補配列生成部と、前記リード配列と前記仮想相補配列をマッピング装置に供給し、前記マッピング装置による参照配列上へのマッピング結果を取得するマッピング情報取得部と、前記マッピング結果において、前記リード配列および対応する仮想相補配列が、それぞれ前記参照配列の２か所に分割されてマッピングされているものを候補リード配列として抽出し、抽出された候補リード配列の分割点をブレークポイント候補とする、候補リード配列抽出部と、前記ブレークポイント候補が所定の塩基数以内の近傍にある候補リード配列を１つのグループにまとめるグループ作成部と、前記グループを構成する候補リード配列の特徴や数に基づいて、各々のグループに含まれる候補リード配列が融合遺伝子由来のものであるか否かを判断するための情報を生成する、融合遺伝子判断情報生成部と、を備えたものである。 The fusion gene analysis system according to the present invention includes a read sequence acquisition unit that acquires a read sequence output from a sequencer, and a virtual complementary sequence generation unit that generates complementary sequences for all the acquired read sequences and outputs them as virtual complementary sequences A mapping information acquisition unit that supplies the lead sequence and the virtual complementary sequence to a mapping device and acquires a mapping result on a reference sequence by the mapping device; and in the mapping result, the lead sequence and the corresponding virtual complement A candidate lead sequence extraction unit that extracts a sequence in which the sequence is divided and mapped in two portions of the reference sequence as a candidate lead sequence, and uses a division point of the extracted candidate lead sequence as a breakpoint candidate; , A candidate read sequence in which the breakpoint candidate is in the vicinity within a predetermined number of bases In order to determine whether or not the candidate lead sequences included in each group are derived from a fusion gene, based on the group creation unit to be grouped into one group and the characteristics and number of candidate lead sequences constituting the group And a fusion gene judgment information generation unit for generating the information.

また、候補リード配列抽出部は、２か所にマッピングされた前記リード配列と前記仮想相補配列の対応する断片が、同じ染色体上にマッピングされているものを候補リード配列として抽出するようにしてもよい。 The candidate lead sequence extraction unit may extract a candidate lead sequence in which the corresponding fragment of the lead sequence mapped to two locations and the virtual complementary sequence are mapped on the same chromosome. Good.

また、前記候補リード配列抽出部は、
２か所にマッピングされた前記リード配列と前記仮想相補配列の各断片が、所定の塩基数以上の長さを有するものを候補リード配列として抽出するようにしてもよい。In addition, the candidate lead sequence extraction unit,
A candidate lead sequence may be extracted in which the fragments of the lead sequence and the virtual complementary sequence mapped at two locations have a length equal to or longer than a predetermined number of bases.

また、前記融合遺伝子判断情報生成部は、各々のグループに含まれる候補リード配列が融合遺伝子由来のものである可能性の高さに応じて各グループをランク付けするようにしてもよい。 The fusion gene determination information generation unit may rank each group according to the possibility that the candidate lead sequence included in each group is derived from the fusion gene.

また、前記融合遺伝子判断情報生成部は、前記グループを構成する候補リード配列の数が多いグループのランクを高く設定するようにしてもよい。
また、前記融合遺伝子判断情報生成部は、前記グループを構成する候補リード配列の一方の分割点に対して、もう一方の分割点が一定数以上存在する場合には、グループのランクを低く設定するようにしてもよい。Further, the fusion gene determination information generation unit may set a higher rank for a group having a large number of candidate lead sequences constituting the group.
In addition, the fusion gene determination information generation unit sets the rank of the group low when there is a predetermined number or more of the other division points with respect to one division point of the candidate lead sequence constituting the group. You may do it.

本発明に係る融合遺伝子解析方法は、シーケンサから出力されるリード配列を取得する工程と、取得したすべてのリード配列について相補配列を作成し、仮想相補配列として出力する工程と、前記リード配列と前記仮想相補配列をマッピング装置に供給し、前記マッピング装置による参照配列上へのマッピング結果を取得する工程と、前記マッピング結果において、前記リード配列および対応する仮想相補配列が、それぞれ前記参照配列の２か所に分割されてマッピングされているものを候補リード配列として抽出し、抽出された候補リード配列の分割点をブレークポイント候補とする工程と、前記ブレークポイント候補が所定の塩基数以内の近傍にある候補リード配列を１つのグループにまとめる工程と、前記グループを構成する候補リード配列の特徴や数に基づいて、各々のグループに含まれる候補リード配列が融合遺伝子由来のものであるか否かを判断するための情報を生成する工程と、を含むものである。 The fusion gene analyzing method according to the present invention includes a step of obtaining a lead sequence output from a sequencer, a step of creating a complementary sequence for all the obtained lead sequences, and outputting as a virtual complementary sequence, the lead sequence and the Supplying a virtual complementary sequence to a mapping device and obtaining a mapping result onto a reference sequence by the mapping device; and in the mapping result, the lead sequence and the corresponding virtual complementary sequence are each two of the reference sequences. A segmented and mapped segment is extracted as a candidate lead sequence, and a segmentation point of the extracted candidate lead sequence is set as a breakpoint candidate, and the breakpoint candidate is in the vicinity of a predetermined number of bases A step of grouping candidate lead sequences into one group, and a candidate library constituting the group Based on the characteristics and number of sequences, it is intended to include the steps of candidate lead sequence included in each of the groups to generate information for determining whether or not derived from a fusion gene, a.

本発明に係るプログラムは、コンピュータを、シーケンサから出力されるリード配列を取得するリード配列取得部と、取得したすべてのリード配列について相補配列を作成し、仮想相補配列として出力する仮想相補配列生成部と、前記リード配列と前記仮想相補配列をマッピング装置に供給し、前記マッピング装置による参照配列上へのマッピング結果を取得するマッピング情報取得部と、前記マッピング結果において、前記リード配列および対応する仮想相補配列が、それぞれ前記参照配列の２か所に分割されてマッピングされているものを候補リード配列として抽出し、抽出された候補リード配列の分割点をブレークポイント候補とする、候補リード配列抽出部と、前記ブレークポイント候補が所定の塩基数以内の近傍にある候補リード配列を１つのグループにまとめるグループ作成部と、前記グループを構成する候補リード配列の特徴や数に基づいて、各々のグループに含まれる候補リード配列が融合遺伝子由来のものであるか否かを判断するための情報を生成する、融合遺伝子判断情報生成部と、して機能させるものである。
A program according to the present invention includes a read sequence acquisition unit that acquires a read sequence output from a sequencer, and a virtual complementary sequence generation unit that generates complementary sequences for all the acquired read sequences and outputs them as virtual complementary sequences A mapping information acquisition unit that supplies the lead sequence and the virtual complementary sequence to a mapping device and acquires a mapping result on a reference sequence by the mapping device; and in the mapping result, the lead sequence and the corresponding virtual complement A candidate lead sequence extraction unit that extracts a sequence in which the sequence is divided and mapped in two portions of the reference sequence as a candidate lead sequence, and uses a division point of the extracted candidate lead sequence as a breakpoint candidate; Candidate reads where the breakpoint candidate is in the vicinity of a predetermined number of bases Based on the characteristics and number of candidate read sequences constituting the group, and a group creation unit that groups the columns into one group, it is determined whether the candidate read sequences included in each group are derived from a fusion gene. It is made to function as a fusion gene judgment information generation part which produces | generates the information for doing.

本発明によれば、融合遺伝子解析の精度向上と時間短縮を実現することができる。 According to the present invention, it is possible to improve the accuracy of fusion gene analysis and shorten the time.

本発明の実施の形態による、融合遺伝子解析システムの概要を示す図。The figure which shows the outline | summary of the fusion gene analysis system by embodiment of this invention. 本発明の実施の形態による、融合遺伝子解析装置の構成を示すブロック図。The block diagram which shows the structure of the fusion gene analyzer by embodiment of this invention. 本発明の実施の形態による、マッピング結果を説明する図。The figure explaining the mapping result by embodiment of this invention. 本発明の実施の形態による、融合遺伝子解析システムの動作のフローチャート。The flowchart of operation | movement of the fusion gene analysis system by embodiment of this invention. 本発明の実施の形態による、融合遺伝子解析システムによる、解析の結果を示す図。The figure which shows the result of the analysis by the fusion gene analysis system by embodiment of this invention.

次に、本発明を実施するための形態について、図面を参照して詳細に説明する。
図１は、本発明の実施の形態による融合遺伝子解析システム１０の構成を示すブロック図である。図に示すように、融合遺伝子解析システム１０は、融合遺伝子解析装置１００と、ＤＮＡシーケンサ２００、遺伝子マッピング装置３００を備えている。融合遺伝子解析装置１００と、ＤＮＡシーケンサ２００、および遺伝子マッピング装置３００は、通信回線５０を介して接続されている。Next, embodiments for carrying out the present invention will be described in detail with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a fusion gene analysis system 10 according to an embodiment of the present invention. As shown in the figure, the fusion gene analysis system 10 includes a fusion gene analysis device 100, a DNA sequencer 200, and a gene mapping device 300. The fusion gene analyzing apparatus 100, the DNA sequencer 200, and the gene mapping apparatus 300 are connected via a communication line 50.

図２は、融合遺伝子解析装置１００の構成を示すブロック図である。図に示すように、融合遺伝子解析装置１００は、リード配列取得部１０１、仮想相補配列生成部１０２、マッピング情報取得部１０３、候補リード配列抽出部１０４、グループ作成部１０５、融合遺伝子判断情報生成部１０６、表示装置１０７、入力装置１０８を備えている。 FIG. 2 is a block diagram showing the configuration of the fusion gene analyzing apparatus 100. As shown in the figure, the fusion gene analyzing apparatus 100 includes a lead sequence acquisition unit 101, a virtual complementary sequence generation unit 102, a mapping information acquisition unit 103, a candidate lead sequence extraction unit 104, a group creation unit 105, and a fusion gene determination information generation unit. 106, a display device 107, and an input device 108.

融合遺伝子解析装置１００は、ＣＰＵ、ＲＯＭやＲＡＭ等のメモリ、各種の情報を格納する外部記憶装置、入力インタフェース、出力インタフェース、通信インタフェース及びこれらを結ぶバスを備える専用又は汎用のコンピュータを適用することができる。なお、融合遺伝子解析装置１００は、単一のコンピュータにより構成されるものであっても、通信回線を介して互いに接続された複数のコンピュータにより構成されるものであってもよい。 The fusion gene analyzing apparatus 100 applies a dedicated or general-purpose computer including a CPU, a memory such as a ROM or a RAM, an external storage device for storing various information, an input interface, an output interface, a communication interface, and a bus connecting them. Can do. The fusion gene analyzing apparatus 100 may be configured by a single computer or may be configured by a plurality of computers connected to each other via a communication line.

リード配列取得部１０１、仮想相補配列生成部１０２、マッピング情報取得部１０３、候補リード配列抽出部１０４、グループ作成部１０５、融合遺伝子判断情報生成部１０６は、ＣＰＵがＲＯＭ等に格納された所定のプログラムを実行することにより実現される機能のモジュールに相当する。 The lead sequence acquisition unit 101, the virtual complementary sequence generation unit 102, the mapping information acquisition unit 103, the candidate lead sequence extraction unit 104, the group creation unit 105, and the fusion gene determination information generation unit 106 have a predetermined CPU stored in a ROM or the like. This corresponds to a function module realized by executing a program.

表示装置１０７は、ディスプレイ等の表示装置であり、融合遺伝子解析装置１００のＣＰＵから出力される画像信号を受けて、各種画像を表示するものである。
入力装置１０８は、マウスやキーボード等を含む各種デバイスであり、ユーザが融合遺伝子解析装置１００に対して各種情報の入力を行う際に使用される。The display device 107 is a display device such as a display, and displays various images in response to an image signal output from the CPU of the fusion gene analyzer 100.
The input device 108 is various devices including a mouse, a keyboard, and the like, and is used when a user inputs various information to the fusion gene analysis device 100.

リード配列取得部１０１は、ＤＮＡシーケンサ２００から出力されるリード配列を取得する。ＤＮＡシーケンサ２００は、シングルエンドリード方式のものでも、ペアエンドリード方式のものでもよい。また、塩基配列の解析に用いるサンプルとしてはゲノムＤＮＡを用いる。さらに、特定領域の塩基配列のみを増幅したターゲットキャプチャーサンプルを用いることにより、解析の効率を高めることができる。 The read sequence acquisition unit 101 acquires the read sequence output from the DNA sequencer 200. The DNA sequencer 200 may be of a single end read type or a pair end read type. Further, genomic DNA is used as a sample used for base sequence analysis. Furthermore, analysis efficiency can be increased by using a target capture sample obtained by amplifying only the base sequence of a specific region.

仮想相補配列生成部１０２は、取得したすべてのリード配列について相補配列を作成し、仮想相補配列として出力する。具体的には、仮想相補配列生成部１０２は、各塩基Ａ（アデニン）、Ｔ（チミン）、Ｇ（グアニン）、Ｃ（シトシン）の配列で構成されるリード配列を入力とし、各塩基を相補的な塩基（Ａ→Ｔ、Ｔ→Ａ、Ｇ→Ｃ、Ｃ→Ｇ）に変換し、逆順に並べ替えたものを、仮想相補配列として出力する。 The virtual complementary sequence generation unit 102 creates a complementary sequence for all the acquired read sequences and outputs it as a virtual complementary sequence. Specifically, the virtual complementary sequence generation unit 102 takes as input a read sequence composed of sequences of bases A (adenine), T (thymine), G (guanine), and C (cytosine), and complements each base. The basic bases (A → T, T → A, G → C, C → G) and rearranged in reverse order are output as virtual complementary sequences.

これにより、シングルエンドリード方式のＤＮＡシーケンサ２００を用いた場合でも、ペアエンドリード方式のように、リード配列と相補配列の組を用いてマッピングを行うことができる。さらに、本実施形態では、取得したすべてのリード配列について仮想相補配列を作成しているため、一般のペアエンドリード方式のように、限られた範囲のみの相補配列が得られるだけではなく、リード配列全体の相補配列を得ることができる。 As a result, even when the DNA sequencer 200 of the single end read method is used, mapping can be performed using a set of the read sequence and the complementary sequence as in the pair end read method. Furthermore, in this embodiment, since a virtual complementary sequence is created for all the acquired read sequences, not only a limited range of complementary sequences can be obtained as in the general paired-end read method, but also the read sequence. The entire complementary sequence can be obtained.

マッピング情報取得部１０３は、リード配列と仮想相補配列を遺伝子マッピング装置３００に供給し、遺伝子マッピング装置３００による参照配列上へのマッピングの結果を取得する。マッピング装置３００は、例えばBWA(Burrows-Wheeler Alignment)-SW(smith-waterman)アルゴリズムにより、リード配列と仮想相補配列を参照配列上へマッピングする。 The mapping information acquisition unit 103 supplies the lead sequence and the virtual complementary sequence to the gene mapping apparatus 300, and acquires the result of mapping onto the reference sequence by the gene mapping apparatus 300. The mapping apparatus 300 maps the lead sequence and the virtual complementary sequence onto the reference sequence by, for example, a BWA (Burrows-Wheeler Alignment) -SW (smith-waterman) algorithm.

候補リード配列抽出部１０４は、マッピング結果において、リード配列および対応する仮想相補配列が、それぞれ参照配列の２か所に分割されてマッピングされているものを候補リード配列として抽出し、抽出された候補リード配列の分割点をブレークポイント候補とする。 The candidate lead sequence extraction unit 104 extracts, as a candidate lead sequence, a mapping result obtained by dividing the mapping of the lead sequence and the corresponding virtual complementary sequence into two portions of the reference sequence. The division point of the lead array is set as a breakpoint candidate.

融合遺伝子は、染色体間で塩基配列の位置が入れ替わる染色体転座や、染色体内の塩基配列の一部が消える間質性欠失、同一染色体内で塩基配列の位置が入れ替わる染色体逆位などによって、もともとは離れていた遺伝子同士が融合することにより形成される。 Fusion genes include chromosomal translocations in which the position of the base sequence is changed between chromosomes, interstitial deletions in which part of the base sequence in the chromosome disappears, and chromosomal inversions in which the position of the base sequence is changed in the same chromosome. It is formed by the fusion of genes that were originally separated.

リード配列に融合遺伝子が含まれている場合、マッピング結果においては、融合遺伝子由来のリード配列は融合箇所（ブレークポイント）を境界として２つに分割され、それぞれの断片が参照配列上の異なる位置にマッピングされる。また、仮想相補配列についても同じブレークポイントを境界として分割され、それぞれの断片が参照配列上で、対応するリード配列と同じ位置にマッピングされる。図３は、マッピング結果の例を示す図である。 If the lead sequence contains a fusion gene, the mapping result shows that the lead sequence derived from the fusion gene is divided into two, with the fusion point (breakpoint) as the boundary, and each fragment is located at a different position on the reference sequence. To be mapped. The virtual complementary sequence is also divided with the same breakpoint as a boundary, and each fragment is mapped to the same position on the reference sequence as the corresponding lead sequence. FIG. 3 is a diagram illustrating an example of the mapping result.

図３に示すように、リード配列（ｒ１）と仮想相補配列（ｒ１’）がそれぞれ２か所に分割され、それぞれの対応する配列が同じ領域にマッピングされている場合、候補リード配列抽出部１０４は、そのリード配列を候補リード配列として抽出する。さらに、その候補リード配列の２か所の分割点（ｂ１、ｂ２）をブレークポイント候補とする。 As shown in FIG. 3, when the lead sequence (r1) and the virtual complementary sequence (r1 ′) are each divided into two locations, and the corresponding sequences are mapped to the same region, the candidate lead sequence extraction unit 104 Extracts the lead sequence as a candidate lead sequence. Further, two division points (b1, b2) of the candidate read sequence are set as breakpoint candidates.

なお、候補リード配列抽出部１０４が候補リード配列を抽出する際、さらに以下の条件（Ａ）〜（Ｄ）を満たすもののみを抽出するようにしてもよい。 Note that when the candidate lead sequence extraction unit 104 extracts candidate lead sequences, only those satisfying the following conditions (A) to (D) may be extracted.

（Ａ）候補リード配列抽出部１０４は、分割されたリード配列と仮想相補配列がそれぞれ２か所（リード配列と仮想相補配列で合計４か所）のみにマッピングされ、かつ、２か所の断片を合わせることで完全なリード配列または仮想相補配列になる場合にのみ、候補リード配列とするようにしてもよい。 (A) The candidate lead sequence extraction unit 104 maps the divided lead sequence and the virtual complementary sequence only in two places (a total of four places in the lead sequence and the virtual complementary sequence), and fragments in two places Only when a complete lead sequence or a virtual complementary sequence is obtained by combining the two, a candidate lead sequence may be used.

（Ｂ）候補リード配列抽出部１０４は、それぞれ２か所にマッピングされたリード配列と仮想相補配列の対応する断片が、同じ染色体上にマッピングされているものを候補リード配列として抽出するようにしてもよい。 (B) The candidate lead sequence extraction unit 104 extracts, as candidate lead sequences, those in which corresponding fragments of the lead sequence mapped to two locations and the virtual complementary sequence are mapped on the same chromosome. Also good.

（Ｃ）候補リード配列抽出部１０４は、それぞれ２か所にマッピングされたリード配列と仮想相補配列の各断片が、所定の塩基数以上の長さ（例えば１０塩基以上）を有するものを候補リード配列として抽出するようにしてもよい。これにより、１塩基のみの変異などによって分割されているものを除外することができる。 (C) The candidate lead sequence extracting unit 104 reads each candidate sequence in which each of the lead sequence mapped to two locations and the virtual complementary sequence has a length of a predetermined number of bases or more (for example, 10 bases or more). You may make it extract as an arrangement | sequence. Thereby, what is divided | segmented by the variation | mutation etc. of only 1 base can be excluded.

（Ｄ）候補リード配列抽出部１０４は、融合遺伝子が同一染色体内で塩基配列の位置が入れ替わる染色体逆位の場合、２か所のブレークポイント候補が１００万塩基以上離れているものを候補リード配列として抽出するようにしてもよい。 (D) The candidate lead sequence extraction unit 104, when the fusion gene is a chromosomal inversion in which the position of the base sequence is interchanged within the same chromosome, determines that the two breakpoint candidates are separated by 1 million bases or more. May be extracted as

グループ作成部１０５は、ブレークポイント候補が所定の塩基数以内の近傍にある候補リード配列を１つのグループにまとめる。
図３において、候補リード配列ｒ２〜ｒ４は、候補リード配列ｒ１とほぼ同じ位置にブレークポイント候補を有している。このような場合、グループ作成部１０５は、候補リード配列ｒ１〜ｒ４を同じブレークポイント候補を有するものとして１つのグループにまとめる。具体的には、例えば、候補リード配列の分割点が４０塩基の誤差の範囲で同じであれば同じグループとするようにしてもよい。The group creation unit 105 collects candidate read sequences having breakpoint candidates in the vicinity within a predetermined number of bases into one group.
In FIG. 3, the candidate lead sequences r2 to r4 have breakpoint candidates at substantially the same positions as the candidate lead sequence r1. In such a case, the group creation unit 105 groups the candidate lead arrays r1 to r4 into one group as having the same breakpoint candidates. Specifically, for example, if the dividing points of the candidate read sequences are the same within an error range of 40 bases, the same group may be used.

融合遺伝子判断情報生成部１０６は、グループを構成する候補リード配列の特徴や数に基づいて、各々のグループに含まれる候補リード配列が融合遺伝子由来のものであるか否かを判断するための情報（ランク付け）を生成する。 The fusion gene determination information generation unit 106 is information for determining whether or not the candidate lead sequences included in each group are derived from the fusion gene based on the characteristics and number of candidate lead sequences constituting the group. (Ranking) is generated.

まず、融合遺伝子判断情報生成部１０６は、各々のグループに含まれる候補リード配列が融合遺伝子由来のものである可能性の高さに応じて各グループをランク付けする。具体的には、各グループについて、以下の４つの絞り込み条件に当てはまるか否かの判定を行う。 First, the fusion gene determination information generation unit 106 ranks each group according to the possibility that the candidate lead sequence included in each group is derived from the fusion gene. Specifically, for each group, it is determined whether or not the following four narrowing conditions are satisfied.

（１）On gene判定
融合遺伝子判断情報生成部１０６は、各グループのリード配列の分割された２つの断片について、遺伝子領域との重複があるかどうかを判定する。両方の断片が遺伝子領域と重複している場合には、条件を満たす（融合遺伝子である可能性が高い）と判定する。(1) On gene determination The fusion gene determination information generation unit 106 determines whether or not there is an overlap with a gene region for two fragments obtained by dividing the lead sequence of each group. If both fragments overlap with the gene region, it is determined that the condition is satisfied (high possibility of being a fusion gene).

（２）既知のターゲット遺伝子判定
融合遺伝子判断情報生成部１０６は、（１）の条件を満たすグループについて、リード配列の分割された２つの断片が、融合遺伝子を構成する遺伝子として知られている遺伝子に該当するか否かを判定する。具体的には、RET,ROS1,ALKなどの受容体型チロシンキナーゼ遺伝子が含まれている場合には、条件を満たすと判定する。これらのキナーゼ遺伝子は、融合遺伝子判定に有効であると共に、治療薬の選択にも役立つという効果がある。(2) Known Target Gene Determination The fusion gene determination information generation unit 106 is a gene in which, for a group satisfying the condition (1), two fragments obtained by dividing the lead sequence are known as genes constituting the fusion gene. It is determined whether it corresponds to. Specifically, if a receptor tyrosine kinase gene such as RET, ROS1, or ALK is included, it is determined that the condition is satisfied. These kinase genes are effective in determining fusion genes and are effective in selecting therapeutic agents.

（３）In frame判定
融合遺伝子判断情報生成部１０６は、（２）の条件を満たすグループについて、リード配列の断片のエクソン領域にフレームシフトが起こっていないかどうかを判定する。フレームシフトが起こっていない場合には、条件を満たすと判定する。エクソン領域にフレームシフトが起こっている場合、タンパク質が合成されないため癌治療のターゲットとしてはあまり適当ではないと考えられる。(3) In frame determination The fusion gene determination information generation unit 106 determines whether or not a frame shift has occurred in the exon region of the fragment of the lead sequence for the group satisfying the condition (2). If no frame shift has occurred, it is determined that the condition is satisfied. When a frameshift occurs in the exon region, it is considered that the protein is not synthesized, so that it is not very suitable as a target for cancer treatment.

（４）Coiled-Coil構造判定
融合遺伝子判断情報生成部１０６は、（３）の条件を満たすグループについて、リード配列の断片の上流の遺伝子が、Coiled-Coil構造を持つか否かを判定し、Coiled-Coil構造を有する場合には、条件を満たすと判定する。例えば、RET,ROS1,ALKなどの受容体型チロシンキナーゼ遺伝子と融合する遺伝子断片の多くは、タンパク質間の相互作用をひきおこすCoiled-Coil構造を有しており、細胞外から増殖のシグナルを伝えるリガンドに非依存的にキナーゼを活性化することが知られている。(4) Coiled-Coil structure determination The fusion gene determination information generation unit 106 determines whether the gene upstream of the lead sequence fragment has a Coiled-Coil structure for the group satisfying the condition (3), When it has a Coiled-Coil structure, it determines with satisfy | filling conditions. For example, many gene fragments fused to receptor tyrosine kinase genes such as RET, ROS1, and ALK have a Coiled-Coil structure that causes protein-protein interactions, and they serve as ligands that transmit growth signals from outside the cell. It is known to activate kinases independently.

融合遺伝子判断情報生成部１０６は、上記の（１）〜（４）の判定条件について、多くの判定条件を満たしているグループほど高いランクを付与する。なお、上記のすべての条件について判定を行わず、一部の条件（例えば、（１）と（２））のみを用いてランク付けを行ってもよい。 The fused gene determination information generation unit 106 assigns higher ranks to groups satisfying many determination conditions for the determination conditions (1) to (4). It should be noted that ranking may be performed using only some of the conditions (for example, (1) and (2)) without performing determination for all the above conditions.

次に、融合遺伝子判断情報生成部１０６は、グループを構成する候補リード配列の数が多いグループのランクを高く設定する。例えば、候補リード配列の数が多いものほど、高いランクに設定する。また、グループを構成する候補リード配列の一方の分割点に対して、もう一方の分割点が一定数以上存在する場合は、マッピングの非特異性が疑われるため、ランクを低く設定するようにしてもよい。例えば、グループ内に、図３に示す分割点（ｂ１、ｂ２）の組を有する候補リード配列と、（ｂ３、ｂ４）という分割点の組を有する候補リード配列があり、ｂ１とｂ３は近接しているが、ｂ２とｂ４は離れている場合には、そのグループのランクを低く設定するようにしてもよい。 Next, the fused gene determination information generation unit 106 sets the rank of the group having a large number of candidate lead sequences constituting the group to be high. For example, the higher the number of candidate read sequences, the higher the rank. Also, if there is more than a certain number of other division points for one division point of the candidate lead sequence constituting the group, the non-specificity of mapping is suspected, so the rank should be set low. Also good. For example, within a group, there are a candidate lead sequence having a set of dividing points (b1, b2) shown in FIG. 3 and a candidate lead sequence having a set of dividing points (b3, b4), and b1 and b3 are close to each other. However, when b2 and b4 are separated, the rank of the group may be set low.

融合遺伝子判断情報生成部１０６は、絞り込み条件によるランク付けと、グループを構成する候補リード配列数によるランク付けに基づいて、すべてのグループのランク付けを行う。例えば、絞り込み条件によるランク付けと、候補リード配列数によるランク付けをそれぞれポイントにし、総合ポイントが高いものほど上位のランクにするようにしてもよい。融合遺伝子判断情報生成部１０６は、候補リード配列をランクの高いものから順に並べたリストを、表示装置１０７に表示する。 The fusion gene determination information generation unit 106 ranks all groups based on ranking based on the narrowing-down conditions and ranking based on the number of candidate lead sequences constituting the group. For example, the ranking based on the narrowing-down condition and the ranking based on the number of candidate lead sequences may be used as points, and the higher the overall points, the higher the rank may be. The fusion gene determination information generation unit 106 displays a list in which the candidate lead sequences are arranged in descending order on the display device 107.

次に、融合遺伝子解析システム１０の動作について説明する。
図４は、融合遺伝子解析システム１０の動作のフローチャートである。
まず、リード配列取得部１０１が、ＤＮＡシーケンサ２００からリード配列を取得する（ステップＳ１）。
次に、仮想相補配列生成部１０２が、取得したすべてのリード配列について相補配列を作成し、仮想相補配列として出力する（ステップＳ２）。Next, the operation of the fusion gene analysis system 10 will be described.
FIG. 4 is a flowchart of the operation of the fusion gene analysis system 10.
First, the read sequence acquisition unit 101 acquires a read sequence from the DNA sequencer 200 (step S1).
Next, the virtual complementary sequence generation unit 102 creates a complementary sequence for all the acquired read sequences and outputs it as a virtual complementary sequence (step S2).

次に、マッピング情報取得部１０３が、リード配列と仮想相補配列を遺伝子マッピング装置３００に入力する（ステップＳ３）。
次に、遺伝子マッピング装置３００が、入力されたリード配列と仮想相補配列のマッピングを行う（ステップＳ４）。
次に、マッピング情報取得部１０３が、遺伝子マッピング装置３００によるマッピングの結果を取得する（ステップＳ５）。Next, the mapping information acquisition unit 103 inputs the lead sequence and the virtual complementary sequence to the gene mapping apparatus 300 (step S3).
Next, the gene mapping apparatus 300 performs mapping between the input read sequence and the virtual complementary sequence (step S4).
Next, the mapping information acquisition part 103 acquires the result of the mapping by the gene mapping apparatus 300 (step S5).

次に、候補リード配列抽出部１０４が、遺伝子マッピング装置３００によるマッピング結果から、候補リード配列を抽出する（ステップＳ６）。
さらに、候補リード配列抽出部１０４は、抽出した候補リード配列のブレークポイント候補を設定する（ステップＳ７）。Next, the candidate lead sequence extraction unit 104 extracts a candidate lead sequence from the mapping result by the gene mapping apparatus 300 (step S6).
Further, the candidate lead sequence extraction unit 104 sets breakpoint candidates for the extracted candidate lead sequence (step S7).

次に、グループ作成部１０５は、ブレークポイント候補が所定の塩基数以内の近傍にある候補リード配列を１つのグループにまとめる（ステップＳ８）。
次に、融合遺伝子判断情報生成部１０６は、融合遺伝子の絞り込み条件を用いて、各グループに含まれる候補リード配列が融合遺伝子由来のものである可能性のランク付けを行う（ステップＳ９）。Next, the group creation unit 105 collects candidate read sequences having breakpoint candidates in the vicinity within a predetermined number of bases into one group (step S8).
Next, the fusion gene determination information generation unit 106 ranks the possibility that the candidate lead sequence included in each group is derived from the fusion gene using the fusion gene narrowing conditions (step S9).

次に、融合遺伝子判断情報生成部１０６は、グループを構成する候補リード配列の数に基づいて、各グループのランク付けを行う（ステップＳ１０）
さらに、融合遺伝子判断情報生成部１０６は、絞り込み条件によるランク付けと、グループを構成する候補リード配列数によるランク付けに基づいて、すべてのグループのランク付けを行い、結果を表示装置１０７に表示する（ステップＳ１１）。Next, the fusion gene determination information generation unit 106 ranks each group based on the number of candidate lead sequences constituting the group (step S10).
Furthermore, the fusion gene determination information generation unit 106 ranks all groups based on the ranking based on the narrowing-down condition and the ranking based on the number of candidate lead sequences constituting the group, and displays the result on the display device 107. (Step S11).

以上のように、本実施形態によれば、シーケンサから出力されるすべてのリード配列に対して仮想相補配列を作成し、リード配列と仮想相補配列のマッピング結果に基づいて候補リード配列を抽出するようにしたので、マッピングをリード配列と仮想相補配列の２重で行うためマッピングの精度が向上し、融合遺伝子解析の精度も向上させることができる。 As described above, according to the present embodiment, virtual complementary sequences are created for all the read sequences output from the sequencer, and candidate read sequences are extracted based on the mapping result between the read sequence and the virtual complementary sequence. Therefore, since the mapping is performed with the double of the lead sequence and the virtual complementary sequence, the accuracy of mapping can be improved and the accuracy of fusion gene analysis can also be improved.

また、ブレークポイント候補が近い候補リード配列を１つのグループにまとめ、グループを構成する候補リード配列の特徴や数に基づいて、候補リード配列をさらに絞り込むようにしたので、最終的に解析すべき候補リード配列の数を高い精度で限定することが可能となり、融合遺伝子解析の効率を向上させ、時間短縮を図ることができる。 In addition, candidate lead sequences that are close to breakpoint candidates are grouped into one group, and the candidate lead sequences are further narrowed down based on the characteristics and number of candidate lead sequences constituting the group. It is possible to limit the number of lead sequences with high accuracy, thereby improving the efficiency of fusion gene analysis and shortening the time.

（実施例）
図５は、融合遺伝子解析システム１０による、解析の結果を示す図である。
図５の例では、ＤＮＡシーケンサ２００として、MiSeq（イルミナ株式会社製）とIon Torrent（サーモフィッシャーサイエンティフィック株式会社製）を用い、３種類の細胞株AD09-232T（ALK-EML4融合遺伝子陽性）、HCC78（ROS1-SCL34A2融合遺伝子陽性）、LC2/ad（CCDC6-RET融合遺伝子陽性）のサンプルを解析した結果を示している。なお、MiSeqはペアエンドリード方式のシーケンサ、Ion Torrentはシングルエンドリード方式のシーケンサである。(Example)
FIG. 5 is a diagram showing the results of analysis by the fusion gene analysis system 10.
In the example of FIG. 5, MiSeq (manufactured by Illumina) and Ion Torrent (manufactured by Thermo Fisher Scientific) are used as the DNA sequencer 200, and three types of cell lines AD09-232T (positive for ALK-EML4 fusion gene) The results of analyzing HCC78 (ROS1-SCL34A2 fusion gene positive) and LC2 / ad (CCDC6-RET fusion gene positive) samples are shown. MiSeq is a pair-end read sequencer, and Ion Torrent is a single-end read sequencer.

入力データの「総リード数」は、ＤＮＡシーケンサ２００から出力されるリード配列の数を示している。また、「リード配列／仮想相補配列」は、リード配列と仮想相補配列生成部１０２によって作成された仮想相補配列を合わせた数であり、総リード数の２倍に相当する。「マッピング結果」は、遺伝子マッピング装置３００によるマッピング後の延べリード数（リード配列と仮想相補配列の合計）を示している。ここでは、BWA-SW方式によりマッピングを行っている。 The “total number of reads” of the input data indicates the number of read sequences output from the DNA sequencer 200. The “lead sequence / virtual complementary sequence” is the total number of the lead sequence and the virtual complementary sequence created by the virtual complementary sequence generation unit 102, and corresponds to twice the total number of reads. The “mapping result” indicates the total number of reads after mapping by the gene mapping apparatus 300 (the total of the read sequence and the virtual complementary sequence). Here, mapping is performed by the BWA-SW method.

「マップ箇所数別分類」には、各々のリードについて、リード配列と対応する仮想相補配列が、合わせて何か所にマップされているかによって分類した結果を示している。上述のように、リード配列と対応する仮想相補配列がそれぞれ２か所、すなわち合計で「４か所」にマッピングされているものが、候補リード配列として抽出される対象となる。さらに、候補リード配列抽出部１０４によって、上述の所定の条件で絞り込みが行われ、最終的に抽出された候補リード配列の数が「候補リード配列数」として示されている。 The “classification according to the number of map locations” shows the result of classifying each lead according to whether the virtual complementary sequence corresponding to the lead sequence is mapped together. As described above, two virtual complementary sequences corresponding to the lead sequence are mapped to two locations, that is, a total of “four locations”, and are extracted as candidate lead sequences. Furthermore, the number of candidate lead sequences finally extracted by the candidate lead sequence extraction unit 104 is narrowed down under the above-described predetermined conditions, and is indicated as “number of candidate lead sequences”.

さらに、グループ作成部１０５によってまとめられたグループの数が「グループ数」として示されている。さらに、融合遺伝子判断情報生成部１０６によってOn gene判定の条件を満たすと判定された候補リード配列の数が「On gene候補数」として示されている。さらに、On gene候補のうち、融合遺伝子判断情報生成部１０６によって、既知のターゲット遺伝子判定の条件を満たすと判定された候補リード配列の数が「RET/ROS1/ALK候補数」として示されている。「RET/ROS1/ALK候補数」を見ると、各サンプルについて、融合遺伝子の候補となるリード配列が６３９，９２４，２７１まで絞り込まれている。 Furthermore, the number of groups collected by the group creation unit 105 is indicated as “number of groups”. Furthermore, the number of candidate lead sequences determined by the fusion gene determination information generation unit 106 to satisfy the On gene determination condition is indicated as “On gene candidate number”. Furthermore, among the On gene candidates, the number of candidate read sequences determined by the fusion gene determination information generation unit 106 to satisfy the known target gene determination condition is indicated as “RET / ROS1 / ALK candidate number”. . Looking at the “number of RET / ROS1 / ALK candidates”, the lead sequences that are candidates for fusion genes are narrowed down to 639,924,271 for each sample.

以上のように、各サンプルについて、融合遺伝子の候補リード数をシーケンサから出力される総リード数から大幅に減少させることができる。 As described above, for each sample, the number of candidate fusion gene reads can be greatly reduced from the total number of reads output from the sequencer.

１０融合遺伝子解析システム、５０通信回線、１００融合遺伝子解析装置、１０１リード配列取得部、１０２仮想相補配列生成部、１０３マッピング情報取得部、１０４候補リード配列抽出部、１０５グループ作成部、１０６融合遺伝子判断情報生成部、１０７表示装置、１０８入力装置、２００ＤＮＡシーケンサ、３００遺伝子マッピング装置 DESCRIPTION OF SYMBOLS 10 Fusion gene analysis system, 50 communication line, 100 Fusion gene analyzer, 101 Lead sequence acquisition part, 102 Virtual complementary sequence generation part, 103 Mapping information acquisition part, 104 Candidate lead sequence extraction part, 105 Group preparation part, 106 Fusion gene Judgment information generation unit, 107 display device, 108 input device, 200 DNA sequencer, 300 gene mapping device

Claims

A read array acquisition unit for acquiring a read array output from the sequencer;
A virtual complementary sequence generation unit that creates a complementary sequence for all of the obtained read sequences and outputs as a virtual complementary sequence;
A mapping information acquisition unit that supplies the lead sequence and the virtual complementary sequence to a mapping device, and acquires a mapping result on a reference sequence by the mapping device;
In the mapping result, the lead sequence and the corresponding virtual complementary sequence are divided and mapped to two locations of the reference sequence, respectively, and extracted as candidate lead sequences, and the division points of the extracted candidate lead sequences are extracted A candidate lead sequence extraction unit,
A group creation unit that groups candidate read sequences in which the breakpoint candidates are in the vicinity of a predetermined number of bases into one group;
Fusion gene judgment information for generating information for judging whether or not the candidate lead sequences included in each group are derived from a fusion gene based on the characteristics and number of candidate lead sequences constituting the group A fusion gene analyzer comprising: a generation unit;

The candidate lead sequence extraction unit includes:
The fusion gene analyzing apparatus according to claim 1, wherein the lead sequence mapped to two locations and the corresponding fragment of the virtual complementary sequence are extracted as candidate lead sequences, which are mapped on the same chromosome.

The candidate lead sequence extraction unit includes:
The fusion gene analyzing apparatus according to claim 1, wherein each of the lead sequence and the virtual complementary sequence fragment mapped in two locations is extracted as a candidate lead sequence having a length equal to or longer than a predetermined number of bases.

The fusion gene judgment information generation unit
The fusion gene analysis apparatus according to claim 1, wherein each group is ranked according to the possibility that the candidate lead sequence included in each group is derived from the fusion gene.

The fusion gene judgment information generation unit
The fusion gene analyzing apparatus according to claim 4, wherein a rank of a group having a large number of candidate read sequences constituting the group is set high.

The fusion gene judgment information generation unit
The fusion gene analyzing apparatus according to claim 4, wherein when there is a predetermined number or more of the other dividing points with respect to one dividing point of the candidate lead sequence constituting the group, the rank of the group is set low. .

Obtaining a read sequence output from the sequencer;
Creating a complementary sequence for all the obtained read sequences and outputting it as a virtual complementary sequence;
Supplying the lead sequence and the virtual complementary sequence to a mapping device, and obtaining a mapping result on a reference sequence by the mapping device;
In the mapping result, the lead sequence and the corresponding virtual complementary sequence are divided and mapped to two locations of the reference sequence, respectively, and extracted as candidate lead sequences, and the division points of the extracted candidate lead sequences are extracted A process for making a breakpoint candidate,
Grouping candidate read sequences in which the breakpoint candidates are in the vicinity within a predetermined number of bases into one group;
Generating information for determining whether or not the candidate lead sequences included in each group are derived from a fusion gene based on the characteristics and number of candidate lead sequences constituting the group. Fusion gene analysis method.

Computer
A read array acquisition unit for acquiring a read array output from the sequencer;
A virtual complementary sequence generation unit that creates a complementary sequence for all of the obtained read sequences and outputs as a virtual complementary sequence;
A mapping information acquisition unit that supplies the lead sequence and the virtual complementary sequence to a mapping device, and acquires a mapping result on a reference sequence by the mapping device;
In the mapping result, the lead sequence and the corresponding virtual complementary sequence are divided and mapped to two locations of the reference sequence, respectively, and extracted as candidate lead sequences, and the division points of the extracted candidate lead sequences are extracted A candidate lead sequence extraction unit,
A group creation unit that groups candidate read sequences in which the breakpoint candidates are in the vicinity of a predetermined number of bases into one group;
Fusion gene judgment information for generating information for judging whether or not the candidate lead sequences included in each group are derived from a fusion gene based on the characteristics and number of candidate lead sequences constituting the group A program that functions as a generator.