JP6675164B2

JP6675164B2 - Mutation judgment method, mutation judgment program and recording medium

Info

Publication number: JP6675164B2
Application number: JP2015149037A
Authority: JP
Inventors: 慶斎藤; 尊規鷲尾
Original assignee: Riken Genesis Co Ltd
Current assignee: Riken Genesis Co Ltd
Priority date: 2015-07-28
Filing date: 2015-07-28
Publication date: 2020-04-01
Anticipated expiration: 2035-07-28
Also published as: JP2017033046A

Description

本発明は、対象遺伝子の変異の有無を判定するためにコンピュータによって実施される変異検出方法、対象遺伝子の変異の有無を判定するためにコンピュータによって実行される変異検出プログラム、および当該変異検出プログラムが記録された記録媒体に関する。 The present invention provides a mutation detection method executed by a computer to determine the presence or absence of a mutation in a target gene, a mutation detection program executed by a computer to determine the presence or absence of a mutation in the target gene, and the mutation detection program. The present invention relates to a recorded recording medium.

近年、シーケンサーを用いた遺伝子の変異の有無の判定技術の開発が進められており、疾患の診断のツールとしても高い有用性が認められている。すなわち、特定の疾患関連遺伝子における既知の変異を含む領域をシーケンシングし、得られた配列データに基づいて当該変異の有無を判定することによって、疾患の診断が可能となる。 In recent years, technology for determining the presence or absence of a gene mutation using a sequencer has been developed, and its usefulness as a tool for diagnosing a disease has been recognized. That is, a region including a known mutation in a specific disease-related gene is sequenced, and the presence or absence of the mutation is determined based on the obtained sequence data, whereby a disease can be diagnosed.

このような変異の有無の判定技術の一例として、次世代シーケンサーを用いたアンプリコンシーケンシングが挙げられる。アンプリコンシーケンシングでは、まず、対象遺伝子のエクソン領域をＰＣＲ増幅し、得られたＰＣＲ産物を次世代シーケンサーによってシーケンシングする。そして、所定のリファレンス配列に対し、シーケンシングで得られたリードをマッピングすることによって、変異の有無を評価および判定することができる。 An example of such a technique for determining the presence or absence of mutation is amplicon sequencing using a next-generation sequencer. In amplicon sequencing, first, an exon region of a target gene is subjected to PCR amplification, and the obtained PCR product is sequenced by a next-generation sequencer. Then, by mapping a read obtained by sequencing to a predetermined reference sequence, the presence or absence of a mutation can be evaluated and determined.

このような変異の有無の判定技術は、多数のサンプルの変異の有無を安価に判定することができるという利点を有している一方、シーケンスエラー等に起因して、誤判定を生じ得るという問題がある。特に、癌患者の血流中の癌細胞由来のｃｔＤＮＡ（circulating tumor DNA）の変異を検出する場合のように、対象のＤＮＡを微量にしか含まない検体や、対象のＤＮＡ以外のコンタミネーションを多く含む検体などのバックグランドが高い検体において変異を検出する場合に、この問題は顕著となる。 Such a technique for determining the presence / absence of mutation has the advantage that the presence / absence of mutation in a large number of samples can be determined at low cost, but the problem that erroneous determination may occur due to sequence errors and the like. There is. In particular, as in the case of detecting mutations in ctDNA (circulating tumor DNA) derived from cancer cells in the bloodstream of a cancer patient, samples containing only a small amount of the target DNA or contaminations other than the target DNA are often used. This problem becomes remarkable when mutation is detected in a sample having a high background such as a sample containing the mutation.

これに対し、Kukitaらは、ｃｔＤＮＡにおいて、癌関連遺伝子である上皮成長因子受容体（epidermal growth factor receptor，ＥＧＦＲ）遺伝子の変異を検出する方法を確立したことを報告している（非特許文献１）。非特許文献１の方法では、変異箇所毎に、変異が存在しない正常検体のシーケンスデータから推定した統計モデルに基づいて閾値を算定し、判定対象となる検体のシーケンシング結果と当該閾値とを比較することによって変異を検出する。 In contrast, Kukita et al. Have reported that they have established a method for detecting a mutation in the epidermal growth factor receptor (EGFR) gene, which is a cancer-related gene, in ctDNA (Non-Patent Document 1). ). In the method of Non-Patent Document 1, a threshold is calculated for each mutation location based on a statistical model estimated from sequence data of a normal sample having no mutation, and the sequencing result of the sample to be determined is compared with the threshold. To detect the mutation.

Kukita et al., PLoS One. 8(11) e81468, 2013.Kukita et al., PLoS One. 8 (11) e81468, 2013.

非特許文献１に記載の技術では、予め、正常検体のシーケンスデータから閾値を算定しておき、その後に、判定対象の検体のシーケンシングを行い、その結果に基づいて変異の検出を行う。詳細には、正常検体のシーケンスデータから１０００００リード中に検出される変異リードの数の確率分布として統計モデルを推定し、当該統計モデルから閾値を算定する。そして、検体について、１０００００リード以上のディープシーケンシングを行い、得られた結果を、１０００００リード中に検出される変異リードの数に変換してから，閾値と比較している。 In the technique described in Non-Patent Document 1, a threshold value is calculated in advance from sequence data of a normal sample, and thereafter, a sample to be determined is sequenced, and mutation is detected based on the result. Specifically, a statistical model is estimated as a probability distribution of the number of mutant reads detected in 100,000 reads from sequence data of a normal sample, and a threshold value is calculated from the statistical model. The sample is subjected to deep sequencing of 100,000 or more reads, the obtained result is converted into the number of mutant reads detected in 100,000 reads, and then compared with a threshold value.

このとき、検体毎および領域毎に、検体から得られるリード数が異なる場合がある。例えば、検体について複数の領域を同時にＰＣＲ増幅した後にシーケンシングを行う場合、各領域における増幅率が一定にならないことがあり、その結果、検体毎および領域毎に、検体から得られるリード数が異なってしまう。 At this time, the number of reads obtained from the sample may differ for each sample and each region. For example, when sequencing is performed after simultaneously PCR-amplifying a plurality of regions for a sample, the amplification rate in each region may not be constant, and as a result, the number of reads obtained from the sample differs for each sample and for each region. Would.

非特許文献１に記載の技術では、固定された数のリード中に検出される変異リードの数の確率分布として統計モデルを予め推定し、検体のシーケンシング結果を、固定された数のリード中に検出される変異リードの数に変換して判定を行っているため、検体毎および領域毎にリード数が異なると、検体毎および変異毎に判定基準が変わり、正確な判定ができない場合がある。 In the technique described in Non-Patent Document 1, a statistical model is estimated in advance as a probability distribution of the number of mutated reads detected in a fixed number of reads, and the sequencing result of a sample is read in a fixed number of reads. If the number of reads is different for each sample and each region, the judgment criteria may change for each sample and each mutation, and accurate judgment may not be possible. .

上記の状況を鑑み、本発明の主たる目的は、検体から得られるリード数が検体毎および領域毎に異なる場合であっても、高精度に変異の有無を判定するための技術を提供することにある。 In view of the above situation, a main object of the present invention is to provide a technique for judging the presence or absence of mutation with high accuracy even when the number of reads obtained from a sample is different for each sample and each region. is there.

上記の課題を解決するために、本発明の一態様に係る変異判定方法は、対象遺伝子の変異の有無を判定するためにコンピュータによって実施される変異判定方法であって、検体由来のポリヌクレオチドのシーケンシングによって得られた複数のリードから、該対象遺伝子の特定の領域に対応する対応リード、および、該対応リードのうち該特定の領域内の特定の変異を有する変異リードを抽出する抽出工程と、抽出した該対応リードの数から所定の有意水準で変異の有無を判定するための閾値を算出する算出工程と、抽出した該変異リードの数が算出した該閾値を超える場合に、該特定の変異が存在すると判定する判定工程と、を包含する。 In order to solve the above problems, a mutation determination method according to one embodiment of the present invention is a mutation determination method that is performed by a computer to determine the presence or absence of a mutation in a target gene, wherein a polynucleotide derived from a specimen is From a plurality of reads obtained by sequencing, a corresponding read corresponding to a specific region of the target gene, and an extraction step of extracting a mutated read having a specific mutation in the specific region among the corresponding reads. A calculating step of calculating a threshold value for determining the presence or absence of a mutation at a predetermined significant level from the number of extracted corresponding reads, and, when the number of extracted mutant reads exceeds the calculated threshold value, Determining the presence of the mutation.

上記構成によれば、対応リードを抽出する毎に、抽出された対応リードの数に基づいて閾値を算出することにより、実際の対応リードの数を用いて、所定の有意水準で変異の有無を判定するための閾値をより適切に算出することができる。これにより、検体から得られるリード数が検体毎および領域毎に異なる場合であっても、高精度に変異の有無を判定することができる。 According to the above configuration, every time a corresponding lead is extracted, a threshold is calculated based on the number of extracted corresponding leads, and the presence or absence of a mutation at a predetermined significance level is determined using the actual number of corresponding leads. The threshold value for determination can be calculated more appropriately. Thus, even if the number of reads obtained from a sample differs for each sample and each region, the presence or absence of a mutation can be determined with high accuracy.

本発明の一態様に係る変異判定方法では、上記コンピュータには、上記特定の変異に対応付けられた統計モデルを示す統計モデル情報が記憶されており、上記算出工程では、上記統計モデル情報を参照して、該対応リードの数から所定の有意水準の有意差の有無を判定するための閾値を算出してもよい。 In the mutation determination method according to one aspect of the present invention, the computer stores statistical model information indicating a statistical model associated with the specific mutation, and refers to the statistical model information in the calculation step. Then, a threshold value for determining the presence or absence of a significant difference of a predetermined significance level from the number of corresponding leads may be calculated.

上記構成によれば、コンピュータに記憶されている統計モデル情報を参照することにより、特定の変異に対応付けられた統計モデルを用いて、所定の有意水準で変異の有無を判定するための閾値を好適に算出することができる。 According to the configuration, by referring to the statistical model information stored in the computer, using the statistical model associated with the specific mutation, the threshold for determining the presence or absence of the mutation at a predetermined significance level It can be suitably calculated.

本発明の一態様に係る変異判定方法では、上記統計モデル情報には、上記特定の変異に対応付けられた統計モデルの種別が含まれていてもよい。 In the mutation determination method according to one aspect of the present invention, the statistical model information may include a type of a statistical model associated with the specific mutation.

上記構成によれば、統計モデル情報に、統計モデルの種別（例えば、ポアソン分布、負の二項分布など）が含まれているため、変異毎に適した確率分布モデルを使用して、より好適に閾値を算出することができる。 According to the above configuration, since the statistical model information includes the type of the statistical model (for example, Poisson distribution, negative binomial distribution, and the like), it is more preferable to use a probability distribution model suitable for each mutation. Can be calculated.

本発明の一態様に係る変異判定方法では、上記統計モデル情報には、上記特定の変異に関する正常検体データがさらに含まれていてもよい。 In the mutation determination method according to one aspect of the present invention, the statistical model information may further include normal sample data relating to the specific mutation.

上記構成によれば、統計モデル情報に、正常検体データが含まれているため、所定の有意水準で変異の有無を判定するための閾値を好適に算出することができる。 According to the above configuration, since the statistical model information includes the normal sample data, the threshold value for determining the presence or absence of the mutation at a predetermined significance level can be preferably calculated.

本発明の一態様に係る変異判定方法では、上記コンピュータには、上記特定の変異に対応付けて、閾値の最小値を使用するか否かを示す最小値情報が記憶されており、
上記算出工程では、該最小値情報が閾値の最小値を使用することを示している場合、予め定められた統計モデルを参照して、該対応リードの数から算出される閾値の最小値を、上記閾値として算出してもよい。 In the mutation determination method according to one aspect of the present invention, the computer is stored in association with the specific mutation, minimum value information indicating whether to use the minimum value of the threshold,
In the calculation step, when the minimum value information indicates that the minimum value of the threshold is to be used, the minimum value of the threshold calculated from the number of corresponding leads is determined by referring to a predetermined statistical model. The threshold may be calculated.

上記構成によれば、閾値が過度に小さくなることを防ぐことができる。これにより、判定結果の信頼性を向上させることができる。 According to the above configuration, it is possible to prevent the threshold from becoming excessively small. Thereby, the reliability of the determination result can be improved.

本発明の一態様に係る変異判定方法では、上記コンピュータには、複数のリファレンス配列が記憶されており、上記抽出工程は、上記複数のリードを、上記複数のリファレンス配列の各々にマッピングすることを含むことが好ましい。 In the mutation determination method according to an aspect of the present invention, the computer stores a plurality of reference sequences, and the extracting step maps the plurality of reads to each of the plurality of reference sequences. It is preferred to include.

上記構成によれば、上記複数のリードを、上記複数のリファレンス配列の各々にマッピングすることによって、当該複数のリードから、対象遺伝子の特定の領域に対応する対応リード、および、対応リードのうち特定の変異を有する変異リードを好適に抽出することができる。 According to the above configuration, by mapping the plurality of reads to each of the plurality of reference sequences, a corresponding read corresponding to a specific region of the target gene from the plurality of reads, Mutant reads having the above mutations can be suitably extracted.

本発明の一態様に係る変異判定方法では、上記特定の変異が、一塩基置換変異である場合には、上記リファレンス配列は、上記特定の領域の配列である特定リファレンス配列を含み、上記抽出工程では、当該特定リファレンス配列にマッピングされた上記リードを上記対応リードとして抽出すると共に、抽出した当該対応リードの各々が上記特定の変異を有するか否かを判定し、上記特定の変異を有する当該対応リードを上記変異リードとして抽出してもよい。 In the mutation determination method according to one aspect of the present invention, when the specific mutation is a single nucleotide substitution mutation, the reference sequence includes a specific reference sequence that is a sequence of the specific region, and the extraction step In the method, the read mapped to the specific reference sequence is extracted as the corresponding read, and it is determined whether or not each of the extracted corresponding reads has the specific mutation. The read may be extracted as the mutant read.

上記構成によれば、上記特定の変異が、一塩基置換変異である場合に、対応リードおよび変異リードを好適に抽出することができる。 According to the configuration, when the specific mutation is a single-base substitution mutation, a corresponding lead and a mutated lead can be suitably extracted.

本発明の一態様に係る変異判定方法では、上記特定の変異が、挿入および欠失の少なくとも一方の変異である場合には、上記リファレンス配列は、上記特定の領域の配列である特定リファレンス配列と、当該特定リファレンス配列において挿入および欠失の少なくとも一方が生じている配列である検出用リファレンス配列とを含み、上記抽出工程では、上記特定リファレンス配列にマッピングされた上記リードおよび上記検出用リファレンス配列にマッピングされた上記リードを上記対応リードとして抽出すると共に、上記検出用リファレンス配列にマッピングされた上記リードを上記変異リードとして抽出してもよい。 In the mutation determination method according to one aspect of the present invention, when the specific mutation is at least one mutation of insertion and deletion, the reference sequence is a specific reference sequence that is a sequence of the specific region. A detection reference sequence which is a sequence in which at least one of insertion and deletion has occurred in the specific reference sequence. In the extraction step, the read and the detection reference sequence mapped to the specific reference sequence are included. The mapped read may be extracted as the corresponding read, and the read mapped to the detection reference sequence may be extracted as the mutant read.

上記構成によれば、上記特定の変異が、挿入および欠失の少なくとも一方からなる変異である場合に、対応リードおよび変異リードを好適に抽出することができる。 According to the configuration, when the specific mutation is a mutation consisting of at least one of an insertion and a deletion, a corresponding lead and a mutated lead can be suitably extracted.

本発明の一態様に係る変異判定方法では、上記特定の変異は、予め定められた塩基の挿入および予め定められた塩基の欠失の少なくとも一方からなる変異であり、上記検出用リファレンス配列には、互いに異なる変異に対応する複数の検出用リファレンス配列が含まれており、上記抽出工程では、上記検出用リファレンス配列のうち、上記特定の変異に対応する検出用リファレンス配列にマッピングされた上記リードを上記変異リードとして抽出してもよい。 In the mutation determination method according to one aspect of the present invention, the specific mutation is a mutation consisting of at least one of a predetermined base insertion and a predetermined base deletion, and the detection reference sequence A plurality of detection reference sequences corresponding to mutually different mutations, and in the extraction step, among the detection reference sequences, the read mapped to the detection reference sequence corresponding to the specific mutation is read. It may be extracted as the mutant read.

上記構成によれば、上記特定の変異が、予め定められた塩基の挿入および予め定められた塩基の欠失の少なくとも一方からなる変異である場合、換言すれば、上記特定の変異が、特定のタイプの挿入および／または欠失変異である場合に、対応リードおよび変異リードを好適に抽出することができる。これにより、挿入および／または欠失変異が存在するか否かの判定しかできなかった従来技術に比べ、挿入および／または欠失変異をタイプ毎に検出することができるため、より詳細な変異情報を得ることができる。また、挿入および／または欠失の変異の有無を判定するための閾値を、タイプ毎に算出するため、より精度高い判定を行うことができる。 According to the above configuration, when the specific mutation is a mutation consisting of at least one of insertion of a predetermined base and deletion of a predetermined base, in other words, the specific mutation is a specific mutation. In the case of a type of insertion and / or deletion mutation, the corresponding lead and the mutated lead can be suitably extracted. As a result, insertion and / or deletion mutations can be detected for each type as compared with the prior art in which it was only possible to determine whether or not insertion and / or deletion mutations exist. Can be obtained. In addition, since a threshold for determining the presence or absence of an insertion and / or deletion mutation is calculated for each type, more accurate determination can be performed.

本発明の一態様に係る変異判定方法において、上記抽出工程では、単一の上記検出用リファレンス配列にマッピングされた上記変異リードのうち、当該検出用リファレンス配列との間で配列の不一致がある不一致リードの割合が所定の割合を超えている場合、当該不一致リードを、上記対応リードおよび上記変異リードから除外してもよい。 In the mutation determination method according to an aspect of the present invention, in the extraction step, among the mutation reads mapped to a single detection reference sequence, a mismatch between the detection reference sequence and the detection reference sequence When the ratio of reads exceeds a predetermined ratio, the mismatched read may be excluded from the corresponding read and the mutant read.

上記構成によれば、特定の変異が、挿入および欠失の少なくとも一方の変異である場合に、単一の検出用リファレンス配列との間で配列の不一致がある不一致リードを信頼性の低いリードとして除去することにより、判定の信頼性を向上させることができる。 According to the above configuration, when the specific mutation is at least one mutation of insertion and deletion, a mismatched read having a sequence mismatch with a single detection reference sequence is regarded as a low-reliability read. By removing it, the reliability of the determination can be improved.

特に、不一致リードが全体に占める割合が所定の割合を超えており、判定結果に対する影響が大きい場合に限って除外することにより、効率よく判定を行うことができる。 In particular, the determination can be performed efficiently by excluding only the case where the mismatched leads occupy the entirety exceeds the predetermined ratio and have a large influence on the determination result.

本発明の一態様に係る変異判定方法において、上記抽出工程では、複数の上記リファレンス配列にマッピングされた上記リードを、上記対応リードおよび上記変異リードから除外してもよい。 In the mutation determination method according to one aspect of the present invention, in the extraction step, the reads mapped to a plurality of the reference sequences may be excluded from the corresponding reads and the mutation reads.

上記構成によれば、複数のリファレンス配列にマッピングされたリードを、信頼性の低いリードとして除去することにより、判定の信頼性を向上させることができる。 According to the above configuration, the reliability of the determination can be improved by removing the read mapped to the plurality of reference sequences as a low-reliability read.

本発明の一態様に係る変異判定方法では、上記抽出工程において、少なくとも一つの上記リードが、上記対応リードから除外された場合に、警告を出力する警告工程をさらに包含してもよい。 The mutation determination method according to one aspect of the present invention may further include a warning step of outputting a warning when at least one of the leads is excluded from the corresponding leads in the extraction step.

上記構成によれば、信頼性の低いリードを除去した場合に、警告を出力することにより、ユーザに判定の信頼性を認識させることができる。 According to the above configuration, when a lead with low reliability is removed, a warning can be output to allow the user to recognize the reliability of the determination.

本発明の一態様に係る変異判定方法では、上記対象遺伝子は、ＥＧＦＲ遺伝子であってもよく、上記複数のリファレンス配列は、配列番号１〜１０１に示される塩基配列の各々を含んでいてもよい。 In the mutation determination method according to one aspect of the present invention, the target gene may be an EGFR gene, and the plurality of reference sequences may include each of the nucleotide sequences shown in SEQ ID NOs: 1 to 101. .

上記構成によれば、ＥＧＦＲ遺伝子の変異を好適に検出することができる。特に、上記リファレンス配列には、既知の挿入および／または欠失変異のほぼ全タイプに対応する検出用リファレンス配列（配列番号５〜１３、１５〜１８、２０〜３３、３５〜１０１）が含まれている上、リードエラーを考慮した検出用リファレンス配列（配列番号１４、１９、３４）が含まれているため、挿入および／または欠失変異を精度高く判定することができる。 According to the above configuration, a mutation in the EGFR gene can be suitably detected. In particular, the reference sequences include detection reference sequences (SEQ ID NOS: 5 to 13, 15 to 18, 20 to 33, 35 to 101) corresponding to almost all types of known insertion and / or deletion mutations. In addition, since a detection reference sequence (SEQ ID NOs: 14, 19, and 34) is included in consideration of a read error, insertion and / or deletion mutation can be determined with high accuracy.

本発明の一態様に係る変異判定方法では、上記抽出工程の前に、上記リファレンス配列を、外部のデータベースから取得した情報に基づいて更新する更新工程をさらに包含してもよい。 The mutation determination method according to one aspect of the present invention may further include, before the extraction step, an update step of updating the reference sequence based on information obtained from an external database.

上記構成によれば、外部のデータベースから最新の知見に基づくリファレンス配列を使用することができるため、所望の変異を好適に検出することができる。 According to the above configuration, since a reference sequence based on the latest knowledge from an external database can be used, a desired mutation can be suitably detected.

本発明の一態様に係る変異判定方法において、上記抽出工程では、上記対応リードの抽出の前に、所定の長さよりも短い上記リードを除去してもよい。 In the mutation determination method according to one aspect of the present invention, in the extraction step, the lead shorter than a predetermined length may be removed before extracting the corresponding lead.

上記構成によれば、短いリードを除去することにより、判定の信頼性を向上させることができる。 According to the above configuration, the reliability of the determination can be improved by removing the short leads.

本発明の一態様に係る変異判定方法では、上記ポリヌクレオチドは、ＰＣＲ産物であり、上記シーケンシングは、同一の領域を重複して読むことにより複数の上記リードを取得するシーケンシングであってもよい。 In the mutation determination method according to one aspect of the present invention, the polynucleotide may be a PCR product, and the sequencing may be a sequencing in which a plurality of reads are obtained by overlappingly reading the same region. Good.

上記構成によれば、変異の判定を好適に行うことができる。 According to the above configuration, the determination of the mutation can be suitably performed.

本発明の一態様に係る変異判定プログラムは、対象遺伝子の変異の有無を判定するためにコンピュータによって実行される変異判定プログラムであって、検体由来のポリヌクレオチドのシーケンシングによって得られた複数のリードから、該対象遺伝子の特定の領域に対応する対応リード、および、該対応リードのうち該特定の領域内の特定の変異を有する変異リードを抽出する抽出工程と、抽出した該対応リードの数から所定の有意水準で変異の有無を判定するための閾値を算出する算出工程と、抽出した該変異リードの数が算出した該閾値を超える場合に、該特定の変異が存在すると判定する判定工程と、を実行する。 The mutation determination program according to one embodiment of the present invention is a mutation determination program executed by a computer to determine the presence or absence of a mutation in a target gene, the plurality of reads being obtained by sequencing a polynucleotide derived from a specimen. From, a corresponding read corresponding to a specific region of the target gene, and, among the corresponding reads, an extraction step of extracting a mutated read having a specific mutation in the specific region, and from the number of extracted corresponding reads A calculation step of calculating a threshold for determining the presence or absence of a mutation at a predetermined significance level, and a determination step of determining that the specific mutation is present when the number of extracted mutation reads exceeds the calculated threshold. , Run.

上記構成によれば、本発明の一態様に係る変異判定方法と同等の効果を奏する。 According to the above configuration, the same effects as those of the mutation determination method according to one embodiment of the present invention can be obtained.

また、本発明の一態様に係る変異判定プログラムを記録したコンピュータ読み取り可能な記録媒体も本発明の範疇である。 Further, a computer-readable recording medium that records the mutation determination program according to one embodiment of the present invention is also included in the scope of the present invention.

本発明によれば、検体から得られるリード数が検体毎および領域毎に異なる場合であっても、高精度に変異の有無を判定することができる。 According to the present invention, the presence or absence of mutation can be determined with high accuracy even when the number of reads obtained from a sample differs for each sample and each region.

本発明の一実施形態に係る変異判定方法を実行するコンピュータ等の概略図である。It is a schematic diagram of a computer etc. which perform the mutation judging method concerning one embodiment of the present invention. 本発明の一実施形態に係る変異判定方法のフローを示す図である。It is a figure showing the flow of the mutation judging method concerning one embodiment of the present invention. 本発明の一実施形態に係るＳＮＶの検出処理のフローを示す図である。It is a figure showing the flow of SNV detection processing concerning one embodiment of the present invention. 本発明の一実施形態に係るＩｎＤｅｌの検出処理のフローを示す図である。FIG. 4 is a diagram illustrating a flow of an InDel detection process according to an embodiment of the present invention. 本発明の一実施形態に係る統計選択モデルの選択処理のフローを示す図である。It is a figure showing the flow of the selection processing of the statistic selection model concerning one embodiment of the present invention. 本発明の一実施形態に係る閾値決定処理のフローを示す図である。It is a figure showing the flow of the threshold value decision processing concerning one embodiment of the present invention. 本発明の一実施形態に係るポアソン分布の閾値決定処理のフローを示す図である。It is a figure showing the flow of the threshold value decision processing of the Poisson distribution concerning one embodiment of the present invention. 本発明の一実施形態に係る負の二項分布の閾値決定処理のフローを示す図である。It is a figure showing the flow of the threshold value decision processing of the negative binomial distribution concerning one embodiment of the present invention. 本発明の実施例に係るテンプレート配列数とマッピングの状況を模式化した図である。FIG. 4 is a diagram schematically illustrating the number of template arrays and the status of mapping according to the embodiment of the present invention. 本発明の実施例に係る文献の公開データの腫瘍検体を用いたテンプレート配列数の増加による、ミスマッピングの改善例を示した図である。FIG. 4 is a diagram showing an example of an improvement in mismapping due to an increase in the number of template sequences using a tumor sample of published data of a document according to an embodiment of the present invention.

本発明の実施の形態について説明すれば、以下の通りである。なお、本発明は、これに限定されるものではない。 Embodiments of the present invention will be described below. Note that the present invention is not limited to this.

〔用語等の定義〕
本明細書において、「ポリヌクレオチド」は、「核酸」または「核酸分子」とも換言でき、ヌクレオチドの重合体を意図している。また、「塩基配列」は、「核酸配列」または「ヌクレオチド配列」とも換言でき、特に言及のない限り、デオキシリボヌクレオチドの配列またはリボヌクレオチドの配列を意図している。また、一本鎖または二本鎖のポリヌクレオチドを包含している。 [Definition of terms, etc.]
As used herein, the term "polynucleotide" can be referred to as "nucleic acid" or "nucleic acid molecule", and is intended to mean a polymer of nucleotides. Further, the “base sequence” can also be referred to as “nucleic acid sequence” or “nucleotide sequence”, and unless otherwise specified, intends a sequence of deoxyribonucleotides or a sequence of ribonucleotides. It also includes single-stranded or double-stranded polynucleotides.

本明細書において、「検体」は、「サンプル」とも換言でき、当該分野において標本、調製物と同義で用いられ、供給源としての生物材料（例えば、個体、体液、細胞株、組織培養物もしくは組織切片）から得られる、任意の調製物が意図される。 As used herein, the term “specimen” can be referred to as “sample,” which is used synonymously with a specimen or preparation in the art, and as a source of biological material (eg, an individual, a body fluid, a cell line, a tissue culture or Any preparation obtained from a tissue section is contemplated.

本明細書において、「検体由来のポリヌクレオチド」は、「ポリヌクレオチド配列サンプル」とも称され、検体を処理することによって得られるポリヌクレオチドを意図する。ポリヌクレオチドを得るための処理としては、核酸抽出処理、核酸増幅処理、逆転写処理等の当該分野において公知の処理を含む。 In the present specification, the “polynucleotide derived from a specimen” is also referred to as a “polynucleotide sequence sample” and intends a polynucleotide obtained by processing a specimen. The treatment for obtaining the polynucleotide includes a treatment known in the art, such as a nucleic acid extraction treatment, a nucleic acid amplification treatment, and a reverse transcription treatment.

本明細書において「リード（read）」「リード配列」または「シーケンスリード」は、シーケンシングによって得られたポリヌクレオチド配列を意味しており、シーケンサーから出力される配列を指している。本明細書において、遺伝子の特定の領域に対応するリードとは、当該遺伝子の特定の領域と同一または許容される範囲の変異を含む配列を有するリードを指す。本明細書において、遺伝子の特定の変異に対応するリードとは、当該遺伝子の特定の変異を含む領域に対応するリードであって、当該特定の変異を含む配列を有するリードを指す。 As used herein, “read”, “read sequence” or “sequence read” refers to a polynucleotide sequence obtained by sequencing, and refers to a sequence output from a sequencer. As used herein, a read corresponding to a specific region of a gene refers to a lead having a sequence containing the same or an allowable range of mutation as the specific region of the gene. In the present specification, a read corresponding to a specific mutation of a gene is a read corresponding to a region containing the specific mutation of the gene, and refers to a read having a sequence containing the specific mutation.

本明細書において、「ＩｎＤｅｌ（Insertion and/or Deletion）」は、挿入、欠失、または挿入および欠失の両方が含まれた変異を意味している。「挿入および／または欠失変異」と記載することもある。 As used herein, “InDel (Insertion and / or Deletion)” means a mutation containing an insertion, a deletion, or both an insertion and a deletion. It may be described as "insertion and / or deletion mutation".

本明細書において、「ＳＮＶ（Single Nucleotide Variant）」は、一塩基置換変異を意味している。 In the present specification, “SNV (Single Nucleotide Variant)” means a single nucleotide substitution mutation.

本明細書において、「ｄｅｐｔｈ」は、同一の配列または領域のシーケンシングによって得られた総リード数を意味している。 As used herein, "depth" refers to the total number of reads obtained by sequencing the same sequence or region.

本明細書において「リファレンス（reference）」または「リファレンス配列」は、リードが遺伝子上のどの領域に対応するか、および／または、リードが遺伝子上のどの変異に対応するかを判定するために、リードをマッピングする対象となる配列である。変異がＳＮＶである場合には、リファレンス配列は、遺伝子の野生型の部分配列であり得る。変異がＩｎＤｅｌである場合には、リファレンス配列は、遺伝子の野生型の部分配列と、当該野生型の部分配列から、公知または非公知のＩｎＤｅｌが生じた配列とを含み得る。なお、本明細書においては、遺伝子の野生型の部分配列から、公知または非公知のＩｎＤｅｌが生じた配列を、「テンプレート（templete）」または「テンプレート配列」とも呼ぶ。 As used herein, a "reference" or "reference sequence" is used to determine which region on a gene a read corresponds to and / or which mutation on a gene the read corresponds to. This is the sequence to which the read is mapped. If the mutation is an SNV, the reference sequence can be a wild-type subsequence of the gene. When the mutation is InDel, the reference sequence may include a wild-type partial sequence of a gene and a sequence in which a known or unknown InDel is generated from the wild-type partial sequence. In the present specification, a sequence in which a known or unknown InDel is generated from a wild-type partial sequence of a gene is also referred to as “template” or “template sequence”.

本明細書において「ＲＥＲ（read error rate）」は読み取りエラー率を意味し、シーケンシングによって得られた総シーケンスリード数に対する、塩基の読み取りエラーの生じたリード数の割合を指す。ここで、読み取りエラーとは、本来の配列とは異なる塩基として読み取ってしまうエラーのことであり、サンプルがＰＣＲ産物の場合には、シーケンシング時に生じたエラーおよびＰＣＲ時に生じたエラーの両方を包含する。 In this specification, "RE (read error rate)" means a read error rate, and indicates a ratio of the number of reads in which a base read error has occurred to the total number of sequence reads obtained by sequencing. Here, the reading error is an error that is read as a base different from the original sequence. When the sample is a PCR product, it includes both an error generated during sequencing and an error generated during PCR. I do.

本明細書において「整列」は、既知配列であるリファレンス配列に対するサンプルのポリヌクレオチド配列中の１つ以上のヌクレオチドの順番の一致として同定される配列の状態を意図している。このような整列は、公知のコンピュータアルゴリズムによって行うことができる。 As used herein, "alignment" intends a state of the sequence identified as an order match of one or more nucleotides in the polynucleotide sequence of the sample relative to a reference sequence, which is a known sequence. Such alignment can be performed by a known computer algorithm.

本明細書において「統計モデル」は、「分布モデル」、「確率分布モデル」とも称され、本明細書において用いられる場合、読み取りエラー率の確率分布を表す統計モデルを指している。 In this specification, the “statistical model” is also referred to as a “distribution model” or a “probability distribution model”, and when used in this specification, refers to a statistical model representing a probability distribution of a reading error rate.

本明細書において「パラメータ」は、数量的データセットおよび／または数量的データセット相互間の数的関係性を特徴付ける変数を意味しており、例えば、分布モデルに適用される変数として記載されている。 As used herein, “parameter” means a variable that characterizes a quantitative data set and / or a numerical relationship between the quantitative data sets, and is described as, for example, a variable applied to a distribution model. .

本明細書において用語「閾値」は、以下の実施形態において説明する分布モデルを用いて算出され、サンプルのポリヌクレオチド配列中の変異の判定において、カットオフ値として作用するリード数としての任意の数値を意味する。本発明の閾値の設定方法は、以下の（工程６．閾値の設定）および実施例等において詳細に説明している。 As used herein, the term “threshold” is calculated using a distribution model described in the following embodiments, and in the determination of a mutation in a polynucleotide sequence of a sample, an arbitrary numerical value as the number of reads that acts as a cutoff value. Means The method of setting a threshold according to the present invention is described in detail in the following (Step 6. Setting of threshold) and in Examples and the like.

本明細書において「マッピング」は、対象となるリファレンス配列に各リードを整列させる処理を意味している。 As used herein, “mapping” means a process for aligning each read with a target reference sequence.

本明細書において「陽性」は、リード配列が判定対象となる真の変異を有していることを意味する。 As used herein, "positive" means that the read sequence has a true mutation to be determined.

本明細書において「偽陽性」は、リード配列が判定対象となる真の変異を有していないにもかかわらず、変異を有すると判断されることを意味する。 In the present specification, “false positive” means that the read sequence is determined to have a mutation even though it does not have a true mutation to be determined.

本明細書において「陰性」は、リード配列が対象となる変異を有していないことを意味する。 As used herein, "negative" means that the lead sequence does not have the mutation of interest.

本明細書において「被検体」とは、ヒト被検体並びにヒトではない被検体、例えば、哺乳類、無脊椎動物、脊椎動物、菌類、酵母、細菌、ウイルスおよび植物などを指す。本明細書の実施例はヒト被検体に関しているが、本発明の概念はヒト以外の任意の動物または植物等の生物由来のゲノムに適用でき、医療、獣医学および動物科学等の分野において有用である。 As used herein, "subject" refers to human subjects as well as non-human subjects, such as mammals, invertebrates, vertebrates, fungi, yeast, bacteria, viruses, plants, and the like. Although the examples herein relate to human subjects, the concepts of the present invention are applicable to genomes from any non-human organism, such as an animal or plant, and are useful in fields such as medicine, veterinary medicine and animal science. is there.

本明細書において「診断する」または「診断」とは、医師によってなされる、患者の徴候および症状に基づく疾患、障害または病態の同定を指す。 As used herein, "diagnose" or "diagnosis" refers to the identification made by a physician of a disease, disorder or condition based on signs and symptoms of a patient.

本明細書中の「検査」とは、医師による同定（診断）を必須としない、検査対象のヒトを含む被検体における、疾患の素因の有無または発症の有無の検査を指す。
〔変異判定方法および変異判定プログラム〕
本発明の一実施形態に係る変異判定方法は、対象遺伝子の変異の有無を判定するためにコンピュータによって実施される変異判定方法であって、例えば、図１に示すコンピュータ１０によって実施され得る。 The “test” in the present specification refers to a test that does not require identification (diagnosis) by a physician, including a test subject, including a human being, for the presence or absence of a predisposition or the onset of a disease.
[Mutation determination method and mutation determination program]
The mutation determination method according to one embodiment of the present invention is a mutation determination method performed by a computer to determine the presence or absence of a mutation in a target gene, and can be performed by, for example, the computer 10 illustrated in FIG.

図１に示すように、コンピュータ１０は、処理部１１、記憶部１２および通信部１３を備えており、記憶部１２に記憶されている変異判定プログラム１００を処理部１１が実行することにより、本実施形態に係る変異判定方法が実施される。変異判定プログラム１００は、変異判定方法の各工程を実施するための命令群１０１、リファレンス配列１０２、最小値情報１０３および統計モデル情報１１０を含んでいる。統計モデル情報１１０は、統計モデル種別１１１および正常検体データ１１２を含んでいる。 As shown in FIG. 1, the computer 10 includes a processing unit 11, a storage unit 12, and a communication unit 13. The computer 10 executes the mutation determination program 100 stored in the storage unit 12 to execute the program. The mutation determination method according to the embodiment is performed. The mutation determination program 100 includes an instruction group 101, a reference sequence 102, minimum value information 103, and statistical model information 110 for executing each step of the mutation determination method. The statistical model information 110 includes a statistical model type 111 and normal sample data 112.

また、コンピュータ１０は、シーケンサー２０に接続されており、シーケンサー２０からシーケンシングの結果を入力されるようになっている。なお、一実施形態において、コンピュータ１０は、ＩｏｎＰＧＭ（登録商標）を用いたシーケンシングデータの解析を行うようになっており、本実施形態に係る変異検出方法は、ＩｏｎＰＧＭ（登録商標）を用いたシーケンシングデータの解析完了後のランに対して実行するようになっている。また、通信部１３は、ネットワークを介して外部サーバ３０との間で通信を行うようになっている。 Further, the computer 10 is connected to the sequencer 20, and receives a result of sequencing from the sequencer 20. In one embodiment, the computer 10 analyzes the sequencing data using IonPGM (registered trademark), and the mutation detection method according to the present embodiment uses IonPGM (registered trademark). It is executed for the run after the analysis of the sequencing data is completed. Further, the communication unit 13 communicates with the external server 30 via a network.

＜判定対象となる変異について＞
本実施形態に係る変異判定方法で判定可能な変異の種類は、ＩｎＤｅｌおよびＳＮＶである。よって、サンプルのポリヌクレオチド配列としては、ＩｎＤｅｌ、ＳＮＶおよびこれらの組み合わせの変異の何れかを有しているものが挙げられる。すなわち、サンプルのポリヌクレオチド配列は、挿入変異、欠失変異および一塩基置換変異のうちの何れかを有する配列であってもよく、これらの変異のうちの２種類以上の変異を同時に有する配列であってもよい。また、ＳＮＶおよびＩｎＤｅｌの変異の両方の変異を有するなど、２種類以上の変異を有する場合は、それぞれの変異毎に独立した検出フローを実行して各変異を個別に検出してもよい。また、例えば、ＩｎＤｅｌおよびＳＮＶの両方を有する配列、ならびに挿入が生じた後に、欠失が起こった配列なども、本実施形態の判定の対象である。なお、一塩基置換変異には、サイレント変異も含まれる。本実施形態に係る変異判定方法の対象とする変異型は、例えば、ＩｎＤｅｌが挙げられ、さらなる一例はＤｅｌｅｔｉｏｎである。本実施形態では所望の個数の、所望の配列を有するものをリファレンス配列として設定することができるため、マッピングにおけるミスマッチを減らすことができ、従来公知の技術と比較して、ＩｎＤｅｌの変異をより好適に検出することができる。 <About the mutation to be determined>
The types of mutations that can be determined by the mutation determination method according to the present embodiment are InDel and SNV. Therefore, the polynucleotide sequence of the sample includes those having any of the mutations of InDel, SNV, and a combination thereof. That is, the polynucleotide sequence of the sample may be a sequence having any one of insertion mutation, deletion mutation and single nucleotide substitution mutation, and may be a sequence having two or more mutations among these mutations at the same time. There may be. When two or more mutations are present, such as having both SNV and InDel mutations, each mutation may be individually detected by executing an independent detection flow for each mutation. In addition, for example, a sequence having both InDel and SNV, a sequence having a deletion after insertion, and the like are also targets of the determination in the present embodiment. The single nucleotide substitution mutation also includes a silent mutation. The mutation type to be subjected to the mutation determination method according to the present embodiment includes, for example, InDel, and a further example is Deletion. In the present embodiment, a desired number of those having a desired sequence can be set as a reference sequence, so that mismatches in mapping can be reduced, and the mutation of InDel is more preferable than a conventionally known technique. Can be detected.

本実施形態に係る変異判定方法で判定対象となる欠失変異における欠失領域の長さは、特に限定されず、任意の長さであってよいが、一例としては５ｂｐ〜３０ｂｐであり、他の例としては１０ｂｐ〜３０ｂｐである。欠失領域が５ｂｐ以上または１０ｂｐ以上であれば、野生型のリファレンス配列との相違が十分に大きく、より高い精度で判定することができる。 The length of the deletion region in the deletion mutation to be determined by the mutation determination method according to the present embodiment is not particularly limited and may be any length, but is, for example, 5 bp to 30 bp, and Is 10 bp to 30 bp. When the deletion region is 5 bp or more or 10 bp or more, the difference from the wild-type reference sequence is sufficiently large, and the determination can be made with higher accuracy.

また、本実施形態に係る変異判定方法で判定対象となる挿入変異における挿入領域の長さは、下限の長さが、例えば５ｂｐ以上、または１０ｂｐ以上である。挿入領域が５ｂｐ以上または１０ｂｐ以上であれば、野生型のリファレンス配列との相違が十分に大きく、より高い精度で判定することができる。また、上限の長さは、一本のシーケンスリード長より十分に短い長さである必要がある。 The length of the lower limit of the length of the insertion region in the insertion mutation to be determined by the mutation determination method according to the present embodiment is, for example, 5 bp or more, or 10 bp or more. When the insertion region is 5 bp or more or 10 bp or more, the difference from the wild-type reference sequence is sufficiently large, and the determination can be made with higher accuracy. Also, the upper limit length needs to be sufficiently shorter than the length of one sequence read.

＜シーケンシング技術およびシーケンシング装置＞
本実施形態に係る変異判定方法は、好ましくは、次世代シークエンシング技術を用いたシーケンシングによって得られたシーケンシングデータを対象としている。すなわち、シーケンサー２０としては、次世代シーケンサーを用いることが好ましい。次世代シーケンサーは、近年開発の進められている一群の塩基配列解析装置であり、クローン的に増幅したＤＮＡテンプレート又は単独ＤＮＡ分子をフローセル内で大量に並列処理を行うことによって、飛躍に向上した解析能力を有している。具体的には、ライフテクノロジー社のＩｏｎＰＧＭ（登録商標）などが挙げられるが、これらに限定されず、今後開発される装置も含まれる。 <Sequencing technology and sequencing equipment>
The mutation determination method according to the present embodiment preferably targets sequencing data obtained by sequencing using the next-generation sequencing technology. That is, it is preferable to use a next-generation sequencer as the sequencer 20. The next-generation sequencer is a group of base sequence analyzers that have been developed in recent years, and greatly improved analysis by performing massively parallel processing of cloned amplified DNA templates or single DNA molecules in a flow cell. Have the ability. Specific examples include, but are not limited to, Ion PGM (registered trademark) of Life Technology Co., Ltd., and also include devices to be developed in the future.

また、本実施形態において使用可能なシークエンシング技術は、同一の領域を重複して読むこと（ディープシーケンシング）により複数のリードを取得するシーケンシング技術であり得る。 Further, the sequencing technology that can be used in the present embodiment may be a sequencing technology that acquires a plurality of reads by reading the same region twice (deep sequencing).

本実施形態において使用可能なシークエンシング技術の例としては、イオン半導体シークエンシング、ピロシークエンシング（pyrosequencing）、可逆色素ターミネータを使用するシークエンシング・バイ・シンセシス（sequencing-by-synthesis）、シークエンシング・バイ・リゲーション（sequencing-by-ligation）、およびオリゴヌクレオチドのプローブ結紮によるシークエンシングなどの、サンガー法以外のシーケンス原理に基づく、１ラン当たりに多数のリードを取得可能なシーケンシング技術が挙げられる。 Examples of sequencing techniques that can be used in this embodiment include ion semiconductor sequencing, pyrosequencing, sequencing-by-synthesis using a reversible dye terminator, sequencing-by-synthesis, and sequencing-by-synthesis. Sequencing techniques that can obtain a large number of reads per run based on sequencing principles other than the Sanger method, such as sequencing-by-ligation and sequencing by oligonucleotide probe ligation. .

シーケンシングに用いるシーケンシングプライマーは特に限定されず、目的の領域を増幅させるのに適した配列に基づいて、適宜設定される。また、シーケンシングに用いられる試薬についても、用いるシーケンシング技術およびシーケンシング装置に応じて好適な試薬を選択すればよい。 The sequencing primer used for sequencing is not particularly limited, and is appropriately set based on a sequence suitable for amplifying a target region. As for the reagent used for sequencing, a suitable reagent may be selected according to the sequencing technique and the sequencing device used.

一実施形態および以下に示す実施例では、ライフテクノロジー社のＩｏｎＰＧＭ（登録商標）を用いて、アンプリコンシーケンスを実施している。一実施例としては、ＥＧＦＲ遺伝子の変異を検出するためのシーケンシングプライマーは変異を含むエクソン（エクソン１８、１９、２０および２１）の領域に設定される。 In one embodiment and the examples described below, the amplicon sequence is performed using Ion PGM (registered trademark) of Life Technology. In one embodiment, a sequencing primer for detecting a mutation in the EGFR gene is set in a region of an exon (exons 18, 19, 20, and 21) containing the mutation.

なお、本発明に係るシーケンスシング技術には、複数のサンプル由来の配列を１ラン中で同時にシーケンスを行う、マルチプレックス法も含まれる。マルチプレックス法の場合は、各サンプル毎のデータを区別するために、各サンプルに固有の「バーコード」配列を付加することによって、データ解析時にサンプル毎にシーケンスデータを識別できる。このマルチプレックス法を用いれば、大量のサンプルの解析においても、データ取得までの時間を劇的に短縮することができる。 The sequencing technique according to the present invention also includes a multiplex method in which sequences derived from a plurality of samples are simultaneously sequenced in one run. In the case of the multiplex method, sequence data can be identified for each sample during data analysis by adding a unique “barcode” sequence to each sample in order to distinguish data for each sample. The use of this multiplex method can dramatically reduce the time required for data acquisition even when analyzing a large number of samples.

＜シーケンスデータ＞
上述のシーケンシンサー２０を用いたシーケンシングによって得られた全リード配列を、本実施形態におけるシーケンシングデータとして使用する。上述のシークエンシングにより、判定対象の各領域毎に、好ましくは１，０００個以上、より好ましくは１００，０００個以上またはそれ以上のリードを生成してもよい。なお、リード数が多いほど検出感度は上昇する。リード長は、使用するプラットフォームに依存した長さを有していればよい。例えば、配列リード長は、使用するマッピングソフトウェアに適合するものであればよく、例えば、１００〜２００ｂｐであってもよく、ペアエンドリードを含んでもよい。このシーケンスデータは、リファレンス配列に対して整列することによってマッピングされ、変異の判定処理が行われる。 <Sequence data>
All the read sequences obtained by sequencing using the above-described sequencing synthesizer 20 are used as sequencing data in the present embodiment. By the above-described sequencing, preferably 1,000 or more, more preferably 100,000 or more or more reads may be generated for each region to be determined. The detection sensitivity increases as the number of leads increases. The lead length may have a length depending on the platform used. For example, the sequence read length may be any length as long as it is compatible with the mapping software used, and may be, for example, 100 to 200 bp, and may include a paired-end read. This sequence data is mapped by aligning it with the reference sequence, and a mutation determination process is performed.

なお、本発明に係るシーケンスデータは、例えばＦＡＳＴＡ形式、ＦＡＳＴＱ形式等で取得されたデータを使用することができる。 In addition, as the sequence data according to the present invention, for example, data obtained in a FASTA format, a FASTQ format, or the like can be used.

＜リファレンス配列の決定＞
本実施形態におけるリファレンス配列１０２は、少なくとも対象となる変異が存在する領域を含む配列から決定される。リファレンス配列の一例は、対象の配列に基づく、少なくとも変異を含む特定の領域の配列である。対象となるゲノム配列中の特に、エクソンの領域の変異を含む場合、リファレンス配列は、当該エクソンの領域に限定的に設定すればよい。ＳＮＶを検出するために用いるリファレンス配列は、対象となる遺伝子の野生型の配列（特定リファレンス配列）であり、ＩｎＤｅｌを検出するために用いるリファレンス配列は、対象となる遺伝子の野生型の配列（特定リファレンス配列）と公知のＩｎＤｅｌの変異を含む配列（検出用リファレンス配列）とである。なお、ＳＮＶを検出するために用いる特定リファレンス配列は、対象となる遺伝子の全長の野生型のゲノム配列であってもよい。 <Determination of reference sequence>
The reference sequence 102 in the present embodiment is determined from a sequence including at least a region where a target mutation exists. One example of a reference sequence is the sequence of a specific region containing at least a mutation based on the sequence of interest. In particular, when the target genomic sequence contains a mutation in an exon region, the reference sequence may be limited to the exon region. The reference sequence used to detect SNV is a wild-type sequence (specific reference sequence) of the target gene, and the reference sequence used to detect InDel is a wild-type sequence (specific reference sequence) of the target gene. Reference sequence) and a sequence containing a known InDel mutation (reference sequence for detection). The specific reference sequence used to detect SNV may be a full-length wild-type genomic sequence of the gene of interest.

リファレンス配列の全長の長さは特に限定されず、目的に応じて所望の長さの配列を用いることができる。ただし、解析の完了までにかかる時間はリファレンス配列の長さに依存して長くなる。 The total length of the reference sequence is not particularly limited, and a sequence having a desired length can be used depending on the purpose. However, the time required to complete the analysis becomes longer depending on the length of the reference sequence.

また、本実施形態に係る変異判定方法において、１つの判定に使用するためのリファレンス配列の個数は特に限定されず、検出したい変異の数に応じて設定される。 In the mutation determination method according to the present embodiment, the number of reference sequences used for one determination is not particularly limited, and is set according to the number of mutations to be detected.

一実施形態において、特定リファレンス配列は、判定対象の変異が含まれるエクソン毎に作成されているものであり得る。 In one embodiment, the specific reference sequence may be generated for each exon containing the mutation to be determined.

このような特定リファレンス配列の一例として、変異判定方法が、ＥＧＦＲ遺伝子の変異の有無を判定するものであるときに、特定リファレンス配列には、配列番号１に示す野生型ＥＧＦＲエクソン１８の配列である特定リファレンス配列、配列番号２に示す野生型ＥＧＦＲエクソン１９の配列である特定リファレンス配列、配列番号３に示す野生型ＥＧＦＲエクソン２０の配列である特定リファレンス配列および配列番号４に示す野生型ＥＧＦＲエクソン２１の配列である特定リファレンス配列が含まれ得る。 As an example of such a specific reference sequence, when the mutation determination method is to determine the presence or absence of a mutation in the EGFR gene, the specific reference sequence is a sequence of wild-type EGFR exon 18 shown in SEQ ID NO: 1. A specific reference sequence, a specific reference sequence which is the sequence of wild type EGFR exon 19 shown in SEQ ID NO: 2, a specific reference sequence which is a sequence of wild type EGFR exon 20 shown in SEQ ID NO: 3, and a wild type EGFR exon 21 which is shown in SEQ ID NO: 4 May be included.

また、一実施形態において、検出用リファレンス配列は、公共のデータベース上から入手された変異に関する情報に基づいて、頻度の高い変異型のみを選択して、予め作成されているものであり得る。 Further, in one embodiment, the reference sequence for detection may be prepared in advance by selecting only frequently-used mutants based on information on mutations obtained from a public database.

このような検出用リファレンス配列の一例として、変異判定方法が、ＥＧＦＲ遺伝子の変異の有無を判定するものであるときに、検出用リファレンス配列には、配列番号５〜１１、１３、１５〜１８、２２、２８、３１、３３、３５および３９に示す検出用リファレンス配列が含まれ得る。これらの検出用リファレンス配列は、配列番号２に示す野生型ＥＧＦＲエクソン１９の配列である特定リファレンス配列に対して、ＣＯＳＭＩＣデータベースから取得した頻度の高いＩｎＤｅｌ変異が反映されたものである。 As an example of such a detection reference sequence, when the mutation determination method is to determine the presence or absence of a mutation in the EGFR gene, the detection reference sequence includes SEQ ID NOs: 5 to 11, 13, 15 to 18, Reference sequences for detection shown in 22, 28, 31, 33, 35 and 39 may be included. These reference sequences for detection are obtained by reflecting, with respect to the specific reference sequence, which is the sequence of the wild-type EGFR exon 19 shown in SEQ ID NO: 2, the frequent InDel mutation obtained from the COSMIC database.

また、別の実施形態において、検出用リファレンス配列の配列は、最新の公共のデータベース上から入手された変異に関する情報に基づいて得られたほぼ全ての変異型について、予め作成されているものである。このように対象となる変異に対応する多数のリファレンス配列を設定することによって、誤ったリファレンスにマッピングされて偽陽性として検出されることを防ぐことができる。結果として、サンプルに含まれる核酸配列をより適切な各変異型に対応する塩基配列にマッピングすることができる。 In another embodiment, the sequence of the reference sequence for detection is a sequence that has been created in advance for almost all mutant types obtained based on information on mutations obtained from the latest public database. . By setting a large number of reference sequences corresponding to the target mutation in this way, it is possible to prevent mapping to an incorrect reference and detection as a false positive. As a result, the nucleic acid sequence contained in the sample can be mapped to a more appropriate base sequence corresponding to each variant.

このような検出用リファレンス配列の一例として、変異判定方法が、ＥＧＦＲ遺伝子の変異の有無を判定するものであるときに、検出用リファレンス配列には、配列番号５〜１３、１５〜１８、２０〜３３、３５〜１０１に示す検出用リファレンス配列が含まれ得る。これらの検出用リファレンス配列は、配列番号２に示す野生型ＥＧＦＲエクソン１９の配列である特定リファレンス配列に対して、ＣＯＳＭＩＣデータベースから取得したほぼ全てのＩｎＤｅｌ変異が反映されたものである。 As an example of such a reference sequence for detection, when the mutation determination method is to determine the presence or absence of a mutation in the EGFR gene, the reference sequence for detection includes SEQ ID NOs: 5 to 13, 15 to 18, 20 to 20. 33, 35 to 101 can be included. These detection reference sequences reflect almost all InDel mutations obtained from the COSMIC database with respect to the specific reference sequence, which is the sequence of the wild-type EGFR exon 19 shown in SEQ ID NO: 2.

また、使用するシーケンサーの特性上、シーケンシングの際に、高い確率で特定のリードエラー（例えば、ＧＧをＧと読み間違えるなど）が生じることがある。マッピングする領域にそのようなリードエラーが生じ得る配列が存在する場合、当該リードエラーに対応するための検出用リファレンス配列をさらに準備してもよい。 In addition, due to the characteristics of the used sequencer, a specific read error (for example, GG is mistaken for G) may occur with high probability during sequencing. When there is a sequence in which such a read error can occur in the area to be mapped, a reference sequence for detection for responding to the read error may be further prepared.

リードエラーに対応するための検出用リファレンス配列は、本来の配列に対して、当該リードエラーが少なくとも一つ生じて誤配列となっている配列である。このようなリードエラーは、挿入変異、欠失変異および一塩基置換変異の何れであるかに依らず、使用するシーケンサーにおいて、高い確率で生じることが知られているものであり得る。また、リードエラーに対応するための検出用リファレンス配列としては、生じ得る全てのリードエラーが反映された配列である必要はなく、そのうちの一部が反映されたものであってよい。 The detection reference array for responding to the read error is an array in which at least one read error has occurred with respect to the original array and is incorrect. Such a read error may be one that is known to occur with high probability in the sequencer used, regardless of whether it is an insertion mutation, a deletion mutation, or a single nucleotide substitution mutation. Further, the reference sequence for detection for coping with the read error does not need to be an array in which all possible read errors are reflected, and may be a sequence in which a part thereof is reflected.

このようなリードエラーに対応するための検出用リファレンス配列を用意することによって、本来変異として陽性判定されるべきリードが、リードエラーを含むために対象の変異に対応する検出用リファレンスにマッピングされなくなることを防ぐことができる。そのため、より高精度の判定を行うことができる。 By preparing a detection reference sequence for such a read error, a read that should be positively determined as a mutation is not mapped to a detection reference corresponding to the target mutation because of including a read error. Can be prevented. Therefore, a more accurate determination can be made.

このような検出用リファレンス配列の一例として、変異判定方法が、ＥＧＦＲ遺伝子の変異の有無を判定するものであるときに、検出用リファレンス配列には、（ｉ）配列番号５〜１３、１５〜１８、２０〜３３、３５〜１０１に示す検出用リファレンス配列に加えて、（ｉｉ）配列番号１３、１８および３３に示す検出用リファレンス配列に対して、ＧＧがＧとして読み間違えられるリードエラーが反映された、配列番号１４、１９および３４に示す、リードエラーに対応するための検出用リファレンス配列が含まれ得る。この場合、リファレンス配列には、配列番号１〜１０１に示されるリファレンス配列が含まれることになる。 As an example of such a reference sequence for detection, when the mutation determination method is to determine the presence or absence of a mutation in the EGFR gene, the reference sequence for detection includes (i) SEQ ID NOS: 5 to 13, 15 to 18 (Ii) In addition to the detection reference sequences shown in SEQ ID NOs: 13, 18, and 33, a read error in which GG is erroneously read as G is reflected in the detection reference sequences shown in SEQ ID NOS: 13, 18, and 33. In addition, reference sequences for detection corresponding to read errors shown in SEQ ID NOs: 14, 19, and 34 may be included. In this case, the reference sequence includes the reference sequences shown in SEQ ID NOs: 1 to 101.

一実施形態において、リファレンス配列の配列情報は公共の公開配列情報データベース上から入手された野生型配列をもとに、公共の公開既知変異情報データベース上から入手された変異に関する情報に基づいて予め作成されているものである。 In one embodiment, the sequence information of the reference sequence is prepared in advance based on information on mutations obtained from a publicly known mutation information database based on a wild-type sequence obtained from a publicly available sequence information database. Is what is being done.

公開配列情報データベースとしては、NCBI RefSeq（ウェブページ、http://www.ncbi.nlm.nih.gov/refseq/）、NCBI GenBank（ウェブページ、http://www.ncbi.nlm.nih.gov/genbank/）、UCSC Genome Browser（ウェブページ、https://genome.ucsc.edu/）などが挙げられる。公開既知変異情報データベースとしては、ＣＯＳＭＩＣデータベース（ウェブページ、http://www.sanger.ac.uk/genetics/CGP/cosmic/)、およびdbSNP（ウェブページ、http://www.ncbi.nlm.nih.gov/SNP/）などが挙げられる。また、リファレンス配列は、その他の商業用データベース、公知文献および実験データ等に基づいて新たに作成したデータベースを用いてもよく、これらの複数のデータベースを組み合わせて用いてもよい。また、リファレンス配列の個数、各配列の長さなどは、これらのデータベース等から取得した配列および配列に関する詳細な情報を用い、目的に応じて適宜設定される。例えば、さらに公開既知変異に関し、人種あるいは動物種別毎の頻度情報を考慮してリファレンス配列を設定してもよい。このような情報を有する公開既知変異情報データベースとしては、HapMap Genome Browser release #28（ウェブページ、http://hapmap.ncbi.nlm.nih.gov/cgi-perl/gbrowse/hapmap28_B36/）、Human Genetic Variation Browser（ウェブページ、http://www.genome.med.kyoto-u.ac.jp/SnpDB/index.html）および（1000 Genomes（ウェブページ、http://www.1000genomes.org/）が挙げられ、これらのデータベースからは、例えば、日本人の変異頻度情報などを入手することができる。 As public sequence information databases, NCBI RefSeq (web page, http://www.ncbi.nlm.nih.gov/refseq/), NCBI GenBank (web page, http://www.ncbi.nlm.nih.gov) / genbank /), UCSC Genome Browser (web page, https://genome.ucsc.edu/) and the like. As publicly known mutation information databases, a COSMIC database (web page, http://www.sanger.ac.uk/genetics/CGP/cosmic/) and dbSNP (web page, http: //www.ncbi.nlm. nih.gov/SNP/). As the reference sequence, a database newly created based on other commercial databases, publicly known documents, experimental data, and the like may be used, or a plurality of these databases may be used in combination. In addition, the number of reference sequences, the length of each sequence, and the like are appropriately set according to the purpose by using sequences obtained from these databases and the like and detailed information on the sequences. For example, a reference sequence may be set for publicly known mutations in consideration of frequency information for each race or animal type. Publicly known mutation information databases having such information include HapMap Genome Browser release # 28 (web page, http://hapmap.ncbi.nlm.nih.gov/cgi-perl/gbrowse/hapmap28_B36/), Human Genetic Variation Browser (web page, http://www.genome.med.kyoto-u.ac.jp/SnpDB/index.html) and (1000 Genomes (web page, http://www.1000genomes.org/) From these databases, for example, Japanese mutation frequency information and the like can be obtained.

また、他の実施形態として、リファレンス配列は、本実施形態に係る変異判定方法において、最新の公開配列情報データベースから入手した変異に関する情報に基づいて作成されたものであってもよい。すなわち、一実施形態において、本実施形態に係る変異判定方法は、通信部１３が、公開既知変異情報データベース等の外部サーバ３０から変異に関する情報を取得し、記憶部１２に記憶されている特定リファレンス配列（野生型の配列）に対して当該変異を反映させた検出用リファレンス配列（テンプレート配列）を作成して、記憶部１２に記憶させる工程を包含していてもよい。 In another embodiment, the reference sequence may be a sequence created based on information on a mutation obtained from the latest public sequence information database in the mutation determination method according to the present embodiment. That is, in one embodiment, in the mutation determination method according to the present embodiment, the communication unit 13 obtains information about mutation from the external server 30 such as a publicly known mutation information database and stores the specific reference stored in the storage unit 12. The method may include a step of creating a detection reference sequence (template sequence) in which the mutation is reflected in the sequence (wild-type sequence) and storing the reference sequence in the storage unit 12.

例を挙げて説明すれば、一実施形態において、変異判定方法が、ＥＧＦＲ遺伝子の変異の有無を判定するものであるときに、通信部１３が、ＣＯＳＭＩＣデータベースから、未取得の変異として、ＥＧＦＲ遺伝子のエクソン１９における「c.2235_2249del15」という変異を取得した場合、通信部１３は、記憶部１２に記憶されているエクソン１９の野生型配列（配列番号２）に対して、「c.2235_2249del15」という変異を反映させて新規な検出用リファレンス配列（配列番号８）を作成し、記憶部１２に記憶させる。 For example, in one embodiment, when the mutation determination method is to determine the presence or absence of a mutation in the EGFR gene, the communication unit 13 sets the EGFR gene as an unacquired mutation from the COSMIC database. When the mutation “c.2235_2249del15” in the exon 19 is obtained, the communication unit 13 compares the wild-type sequence of the exon 19 (SEQ ID NO: 2) stored in the storage unit 12 with “c.2235_2249del15”. A new reference sequence for detection (SEQ ID NO: 8) is created by reflecting the mutation, and stored in the storage unit 12.

このような構成を有する場合、サンプルに含まれる核酸配列は、最新の変異型にマッピングすることができるため、ミスマッチまたは誤判定を減らすことができ、より精度の高いマッピングが可能となる。 In the case of having such a configuration, the nucleic acid sequence contained in the sample can be mapped to the latest mutant type, so that mismatches or erroneous determinations can be reduced, and more accurate mapping can be performed.

また、リファレンス配列は、ある変異を有する配列が重複して２つ以上存在しないように、適宜設定する（マルチヒット可能性の除去）。つまり、１変異に対し、１つのリファレンス配列が対応するように決定すればよい。例えば、データベース上では異なるＩＤを付されて異なる検体として登録されているが、同じ変異を有する場合は何れかを除去する。 In addition, the reference sequence is appropriately set so that two or more sequences having a certain mutation do not overlap (removal of the possibility of multi-hit). That is, it is sufficient to determine one reference sequence to correspond to one mutation. For example, different specimens are registered as different specimens on the database, but if they have the same mutation, one of them is removed.

＜データ解析フロー＞
以下、本発明の一実施形態に係る変異判定方法のフロー（データ解析フロー）の一例を詳細に説明する。以下では、対象遺伝子としてＥＧＦＲ遺伝子を用い、ＥＧＦＲ遺伝子の所定の変異を検出するフローについて説明するが、本発明はこれに限定されない。 <Data analysis flow>
Hereinafter, an example of the flow (data analysis flow) of the mutation determination method according to an embodiment of the present invention will be described in detail. Hereinafter, a flow for detecting a predetermined mutation in the EGFR gene using the EGFR gene as the target gene will be described, but the present invention is not limited to this.

（工程１．シーケンスリードデータの入力）
図２は、本発明の一実施形態に係るデータ解析フローの一例を示す図である。まず、処理部１１は、判定に供するシーケンスリードデータを処理部１１に入力する（ステップＳ２０１）。本発明の解析に用いられる入力データは、公知のシーケンス解析プログラムを実行後に出力されるシーケンス結果ファイルを用いることが可能である。すなわち、処理部１１は、当該シーケンス結果ファイルを読み出すことによって、シーケンスリードデータを処理部１１に入力する。 (Step 1. Sequence read data input)
FIG. 2 is a diagram illustrating an example of a data analysis flow according to an embodiment of the present invention. First, the processing unit 11 inputs sequence read data to be used for the determination to the processing unit 11 (step S201). As the input data used for the analysis of the present invention, a sequence result file output after executing a known sequence analysis program can be used. That is, the processing unit 11 reads out the sequence result file and inputs the sequence read data to the processing unit 11.

一実施形態では、ＩｏｎＰＧＭから出力されるシークエンス結果ファイル（ファイル名Unaligned BAM）を使用している。このファイルは、ＩｏｎＰＧＭでの通常のマッピング実施前のデータであるため、本発明のソフトウェアを実行するための入力データとして使用可能である。 In one embodiment, a sequence result file (file name Unaligned BAM) output from IonPGM is used. Since this file is data before normal mapping is performed in IonPGM, it can be used as input data for executing the software of the present invention.

（工程２．リードの除去）
ステップＳ２０２に示す通り、解析を行う前に、処理部１１は、解析への使用に適さないリードデータの除去を行う。解析への使用に適さないリードは、例えば、短すぎるリードを意味する。例えば、処理部１１は、ステップＳ２０１において入力されたシーケンスリードデータに含まれるリードのうち、７０ｂｐ以下の長さのリードを除去する。なお、本実施形態はこれに限定されず、処理部１１は、他の一例として、８０ｂｐ以下の長さのリード、または、６０ｂｐ以下の長さのリードを除去してもよい。短すぎるリードを除去することにより、読み取りエラーなどになる可能性の高いリードが取り除かれるために、より高精度な解析を行うことができる。 (Step 2. Lead removal)
As shown in step S202, before performing analysis, the processing unit 11 removes read data that is not suitable for use in analysis. A lead that is not suitable for use in analysis means, for example, a lead that is too short. For example, the processing unit 11 removes a read having a length of 70 bp or less from the reads included in the sequence read data input in step S201. Note that the present embodiment is not limited to this, and the processing unit 11 may remove a lead having a length of 80 bp or less or a lead having a length of 60 bp or less as another example. By removing a lead that is too short, a lead that is likely to cause a reading error or the like is removed, so that more accurate analysis can be performed.

（工程３．ＳＮＶの検出処理）
続いて、ステップＳ２０３に示す通り、処理部１１は、ＳＮＶの検出処理を行う。図３は、本発明の一実施形態に係るＳＮＶの検出処理の詳細なフローを示す図である。 (Step 3. SNV detection processing)
Subsequently, as illustrated in step S203, the processing unit 11 performs an SNV detection process. FIG. 3 is a diagram showing a detailed flow of an SNV detection process according to an embodiment of the present invention.

図３に示すように、ＳＮＶの検出フローにおいては、まず、処理部１１は、シーケンスリードを、記憶部１２に記憶されているリファレンス配列（野生型のＥＧＦＲのエクソン１８、１９および２０の配列）に対しマッピングし、対象遺伝子（ＥＧＦＲ遺伝子）の特定の領域（エクソン１８、エクソン２０およびエクソン２１）に対応する対応リードおよび変異リードを抽出する（ステップＳ３０３、抽出工程）。 As shown in FIG. 3, in the SNV detection flow, first, the processing unit 11 reads the sequence read from the reference sequence (the sequence of exons 18, 19, and 20 of wild-type EGFR) stored in the storage unit 12. To extract corresponding reads and mutant reads corresponding to specific regions (exon 18, exon 20, and exon 21) of the target gene (EGFR gene) (step S303, extraction step).

マッピングは、リファレンス配列に対し、リードの配列をマッピングするための公知の手法およびコンピュータプログラムを用いて行うことが可能である。なお、本明細書において、「リファレンス配列に対し、リードの配列をマッピングする」とは、リードの配列と、各リファレンス配列とを比較し、リードの配列に対応するリファレンス配列として尤もらしい（例えば、マッピングスコアが最も高い）リファレンス配列を特定することを意味する。本実施形態では、ＩоｎＰＧＭの通常のシーケンシング結果解析プログラムを実行する装置において、BWA 0.6.2のbwaswコマンドを用いて行われる。 Mapping can be performed using a known method and a computer program for mapping a read sequence to a reference sequence. In the present specification, “mapping a read sequence to a reference sequence” refers to comparing a read sequence with each reference sequence, and assuming that the read sequence is a reference sequence corresponding to the read sequence (for example, This means that a reference sequence with the highest mapping score is specified. In the present embodiment, the processing is performed by using the bwasw command of BWA 0.6.2 in an apparatus that executes a normal sequencing result analysis program of the IconPGM.

マッピングには配列を整列させる多くの公知のソフトウェアまたはコンピュータアルゴリズムが利用可能である。このようなソフトウェアとしては、特に限定されないが、BWA、Bowtie2およびBFASTなどが挙げられる。アルゴリズムとしては、特に限定されないが、Burrows Wheeler Transform (BWT)、read hash、genome hash、merge sortおよびSmith-Watermanなどが使用可能である。 Many known software or computer algorithms for aligning sequences are available for mapping. Such software includes, but is not limited to, BWA, Bowtie2 and BFAST. Although the algorithm is not particularly limited, Burrows Wheeler Transform (BWT), read hash, genome hash, merge sort, Smith-Waterman, and the like can be used.

処理部１１は、対象遺伝子（ＥＧＦＲ遺伝子）の特定の領域（エクソン１８、エクソン２０およびエクソン２１）のリファレンス配列にマッピングされたリードを対応リードとして抽出するとともに、当該リファレンス配列と当該対応リードの配列とを比較して、当該対応リードにおける変異を検出する。記憶部１２には、本実施形態において検出すべき特定の変異を規定する変異情報が記憶されており、処理部１１は、リファレンス配列と当該リファレンス配列にマッピングされた対応リードの配列とを比較して得られた変異と、変異情報に規定された変異とを比較することにより、対応リードが、本実施形態において検出すべき特定の変異を含んでいるか否かを判定することができる。これにより、処理部１１は、対応リードおよび変異リードを抽出することができる。 The processing unit 11 extracts, as a corresponding read, a read mapped to a reference sequence of a specific region (exon 18, exon 20, and exon 21) of the target gene (EGFR gene), and extracts the reference sequence and the sequence of the corresponding read. And the mutation in the corresponding read is detected. The storage unit 12 stores mutation information that defines a specific mutation to be detected in the present embodiment. The processing unit 11 compares the reference sequence with the sequence of the corresponding read mapped to the reference sequence. By comparing the obtained mutation with the mutation specified in the mutation information, it can be determined whether or not the corresponding read contains a specific mutation to be detected in the present embodiment. Thereby, the processing unit 11 can extract the corresponding read and the mutated read.

続いて、ステップＳ３０４において、処理部１１は、信頼性の低いマッピング結果および曖昧なマッピング結果を除去する。具体的には、処理部１１は、ステップＳ３０３において抽出した対応リード（および変異リード）のうち、複数のリファレンス配列にマッピングされた対応リード（および変異リード）を除去する。これにより、判定結果の信頼性を向上させることができる。なお、本実施形態では、BWAを用いて得られたマッピング結果において、複数のリファレンスにヒットしたリードを、（mappingQuality=0）かつ（QC fail）としてＢＡＭ（シークエンス結果ファイル）に情報を付加させている。処理部１１は、これらの情報を参照してもよい。 Subsequently, in step S304, the processing unit 11 removes a mapping result with low reliability and an ambiguous mapping result. Specifically, the processing unit 11 removes the corresponding reads (and the mutated reads) mapped to the plurality of reference sequences from the corresponding reads (and the mutated reads) extracted in Step S303. Thereby, the reliability of the determination result can be improved. In the present embodiment, in a mapping result obtained by using the BWA, a read that hits a plurality of references is added to a BAM (sequence result file) as (mappingQuality = 0) and (QC fail). I have. The processing unit 11 may refer to such information.

続いて、ステップＳ３０５において、処理部１１は、変数Ｒに、変異判定に必要な最低リード数を設定する。このような最低リード数の値は、例えば、１０００または１００００といった値に予め定義しておくことができる。 Subsequently, in step S305, the processing unit 11 sets the minimum read number required for mutation determination in the variable R. Such a value of the minimum number of reads can be defined in advance, for example, as a value such as 1000 or 10,000.

続いて、処理部１１は、判定対象の変異（変異１、変異２、・・・、変異Ｎ）を順に変えながら、ステップＳ３０６とステップＳ３１４の間を繰り返す。 Subsequently, the processing unit 11 repeats steps S306 and S314 while sequentially changing the mutations to be determined (mutation 1, mutation 2,..., Mutation N).

まず、ステップＳ３０７において、処理部１１は、変数Ｘに変異毎に測定済みの正常検体データ、変数Ｍに変異毎に定義済みの分布モデル、変数Ｄに対象の変異箇所にマッピングされた総リード数（対応リードの数）、変数Ａに対象の変異箇所にマッピングされた変異リード数をそれぞれ代入する。 First, in step S307, the processing unit 11 sets the variable X as normal sample data measured for each mutation, the variable M as a distribution model defined for each mutation, and the variable D as the total number of reads mapped to the target mutation location. (Number of corresponding reads) and the number of mutation reads mapped to the target mutation location are substituted for variable A.

次に、ステップＳ３０８において、処理部１１は、変数Ｄと変数Ｒとを比較し、変数Ｄ＜変数Ｒである場合（Ｙｅｓ）、すなわち、変数Ｄに対象の変異箇所にマッピングされた総リード数（対応リードの数）が、変異判定に必要な最低リード数に満たない場合、ステップＳ３１３において未検出（Ｎ／Ｄ）と判定し、次の変異の判定に移行する。変数Ｄ＜変数Ｒではない場合（Ｎｏ）には、ステップＳ３０９に進む。 Next, in step S308, the processing unit 11 compares the variable D with the variable R, and if the variable D <the variable R (Yes), that is, the total number of reads mapped to the target mutation location in the variable D If (the number of corresponding reads) is less than the minimum number of reads necessary for mutation determination, it is determined that no detection (N / D) has been made in step S313, and the process proceeds to the next mutation determination. If variable D <variable R is not satisfied (No), the process proceeds to step S309.

ステップＳ３０９では、処理部１１は、「対象の変異についての正常検体データ」として変数Ｘに代入されたデータ、「対象の変異の分布モデル」として変数Ｍに代入されたデータ、および「観測された総リード数」として変数Ｄに代入された値を用いて閾値決定フローを実行し、所定の有意水準で変異の有無を判定するための閾値を算出する（算出工程）。閾値決定フローの詳細については後述する。 In step S309, the processing unit 11 assigns the data assigned to the variable X as “normal specimen data for the target mutation”, the data assigned to the variable M as the “distribution model of the target mutation”, and “observed A threshold value determination flow is executed using the value assigned to the variable D as the “total number of reads”, and a threshold value for determining the presence or absence of a mutation at a predetermined significance level is calculated (calculation step). Details of the threshold value determination flow will be described later.

続いて、ステップＳ３１０において、処理部１１は、変数Ａと、ステップＳ３０９において算出した閾値とを比較し、変数Ａ≧閾値である場合（Ｙｅｓ）、陽性と判定し（ステップＳ３１１、判定工程）、変数Ａ≧閾値でない場合（Ｎｏ）、陰性と判定する（ステップＳ３１２）。 Subsequently, in step S310, the processing unit 11 compares the variable A with the threshold calculated in step S309, and when variable A ≧ the threshold (Yes), determines that the variable is positive (step S311; determination step). If the variable A is not equal to or larger than the threshold (No), it is determined to be negative (step S312).

以上を判定対象の各変異について行う。 The above is performed for each mutation to be determined.

なお、変異が存在しない野生型の配列の場合は、野生型として判定すればよい。また、同一検体において、複数の異なる種類の変異の型が同時にマッピング候補として検出される場合は、各変異型についてそれぞれ独立して、上述した検出フローを実行することによって変異を検出してもよい。 In the case of a wild-type sequence having no mutation, it may be determined as a wild-type sequence. Further, when a plurality of different types of mutations are simultaneously detected as mapping candidates in the same sample, the mutations may be detected by executing the above-described detection flow independently for each mutant type. .

（工程４．ＩｎＤｅｌの検出処理）
続いて、ステップＳ２０４に示す通り、処理部１１は、ＩｎＤｅｌの検出処理を行う。図４は、本発明の一実施形態に係るＩｎＤｅｌの検出処理の詳細なフローを示す図である。 (Step 4. InDel detection process)
Subsequently, as illustrated in step S204, the processing unit 11 performs an InDel detection process. FIG. 4 is a diagram showing a detailed flow of the InDel detection process according to an embodiment of the present invention.

図４に示すように、ＩｎＤｅｌの検出フローにおいては、まず、処理部１１は、シーケンスリードを、記憶部１２に記憶されているリファレンス配列（野生型のＥＧＦＲのエクソン１９の配列、および、検出対象のＩｎＤｅｌが反映されたＥＧＦＲのエクソン１９の配列）に対しマッピングして、対応リードおよび変異リードを抽出する（ステップＳ４０２、抽出工程）。 As shown in FIG. 4, in the detection flow of InDel, first, the processing unit 11 reads the sequence read from the reference sequence (the sequence of the exon 19 of the wild-type EGFR and the detection target) stored in the storage unit 12. (The sequence of exon 19 of EGFR in which InDel is reflected) to extract corresponding reads and mutant reads (step S402, extraction step).

処理部１１は、対象遺伝子（ＥＧＦＲ遺伝子）の特定の領域（野生型またはＩｎＤｅｌを有するエクソン１９）のリファレンス配列にマッピングされたリードを対応リードとして抽出するとともに、特定の変異（ＩｎＤｅｌ）を有するリファレンス配列にマッピングされた対応リードを変異リードとして抽出する。 The processing unit 11 extracts a read mapped to a reference sequence of a specific region (wild type or exon 19 having InDel) of a target gene (EGFR gene) as a corresponding read, and extracts a reference having a specific mutation (InDel). The corresponding reads mapped to the sequence are extracted as mutant reads.

続いて、ステップＳ４０３において、処理部１１は、信頼性の低いマッピング結果および曖昧なマッピング結果を除去する。具体的には、処理部１１は、ステップＳ４０２において抽出した対応リード（および変異リード）のうち、複数のリファレンス配列にマッピングされた対応リード（および変異リード）を除去する。これにより、判定結果の信頼性を向上させることができる。 Subsequently, in step S403, the processing unit 11 removes the mapping result with low reliability and the ambiguous mapping result. Specifically, the processing unit 11 removes the corresponding reads (and the mutated reads) mapped to the plurality of reference sequences from the corresponding reads (and the mutated reads) extracted in Step S402. Thereby, the reliability of the determination result can be improved.

また、ステップＳ４０４において、処理部１１は、ＩｎＤｅｌを有するリファレンス配列（テンプレート配列）にマッピングされたリードのうち１０％以上の割合のリードが、テンプレートに対してミスマッチ（不一致）を有している場合に、当該ミスマッチを有するリード（不一致リード）を対応リードおよび変異リードから除去する。なお、ＩоｎＰＧＭのシーケンシング結果解析プログラムでは、ミスマッチを有するリードは、例えば、QC failとしてＢＡＭに情報を付加させ、処理部１１はこれを参照してもよい。なお、本実施形態はこれに限定されず、処理部１１は、他の例として、５〜２０％の範囲以上の割合のリードがミスマッチを有している場合に、当該ミスマッチを有しているリードを除去してもよい。これにより、判定結果の信頼性を向上させることができる。 In addition, in step S404, the processing unit 11 determines that, among the reads mapped to the reference sequence (template sequence) having InDel, 10% or more of the reads have a mismatch (mismatch) with the template. Next, the read having the mismatch (mismatched read) is removed from the corresponding read and the mutated read. Note that, in the IngPGM sequencing result analysis program, a read having a mismatch may cause the BAM to add information to the BAM as a QC fail, for example, and the processing unit 11 may refer to this. Note that the present embodiment is not limited to this, and the processing unit 11 has, as another example, the mismatch in the case where the read of the ratio of 5% to 20% or more has a mismatch. The leads may be removed. Thereby, the reliability of the determination result can be improved.

また、ミスマッチを有するリード除去について、例えば、１０％以上の不一致リードを除去後のデータに対して、さらにその中で１０％以上の不一致リードを除去する、という処理を不一致リードがなくなるまで繰り返す処理を行ってもよい。このように、繰り返し除去を行うことにより、不一致リードを可能な限り除去することができる。 In addition, for the removal of a lead having a mismatch, for example, a process of removing 10% or more of mismatched leads from data after removing 10% or more of mismatched leads is repeated until there is no mismatched lead. May be performed. As described above, by performing repeated removal, mismatched leads can be removed as much as possible.

なお、ミスマッチとは、リファレンス配列に対するリード配列の１か所以上の相違の存在を指している。ここで、一か所以上の相違とは、少なくとも一つの塩基の相違、少なくとも一塩基以上の塩基の欠失および／または挿入、ならびにこれらの組み合わせである。上述の割合以上でミスマッチが存在する場合、これらのリードは偽陽性の要因となり得る。例えばリファレンス配列には存在しない新規な変異を含むリードは、ミスマッチとして除去され得る。 Note that a mismatch refers to the presence of one or more differences in the read sequence from the reference sequence. Here, the one or more differences are at least one base difference, at least one base deletion and / or insertion, and a combination thereof. These reads can cause false positives if there is a mismatch at or above the percentages described above. For example, reads containing new mutations not present in the reference sequence can be removed as mismatches.

続いて、ステップＳ４０５において、処理部１１は、変数Ｒに、変異判定に必要な最低リード数を設定する。 Subsequently, in step S405, the processing unit 11 sets the minimum read number required for the mutation determination in the variable R.

続いて、処理部１１は、判定対象の変異（変異１、変異２、・・・、変異Ｎ）を順に変えながら、ステップＳ４０６とステップＳ４１４の間を繰り返す。 Subsequently, the processing unit 11 repeats steps S406 and S414 while sequentially changing the mutations to be determined (mutation 1, mutation 2,..., Mutation N).

まず、ステップＳ４０７において、処理部１１は、変数Ｘに変異毎に測定済みの正常検体データ、変数Ｍに変異毎に定義済みの分布モデル、変数Ｄに対象の変異箇所にマッピングされた総リード数（対応リードの数）、変数Ａに対象の変異箇所にマッピングされた変異リード数をそれぞれ代入する。 First, in step S407, the processing unit 11 sets the variable X to normal sample data measured for each mutation, the variable M to a distribution model defined for each mutation, and the variable D to the total number of reads mapped to the target mutation location. (Number of corresponding reads) and the number of mutation reads mapped to the target mutation location are substituted for variable A.

次に、ステップＳ４０８において、処理部１１は、変数Ｄと変数Ｒとを比較し、変数Ｄ＜変数Ｒである場合（Ｙｅｓ）、すなわち、変数Ｄに対象の変異箇所にマッピングされた総リード数（対応リードの数）が、変異判定に必要な最低リード数に満たない場合、ステップＳ４１３において未検出（Ｎ／Ｄ）と判定し、次の変異の判定に移行する。変数Ｄ＜変数Ｒではない場合（Ｎｏ）には、ステップＳ４０９に進む。 Next, in step S408, the processing unit 11 compares the variable D with the variable R, and when the variable D <the variable R (Yes), that is, the total number of reads mapped to the target mutation location in the variable D If (the number of corresponding reads) is less than the minimum number of reads required for mutation determination, it is determined that no detection (N / D) has been made in step S413, and the process proceeds to determination of the next mutation. If variable D <variable R is not satisfied (No), the process proceeds to step S409.

ステップＳ４０９では、処理部１１は、「対象の変異についての正常検体データ」として変数Ｘに代入されたデータを、「対象の変異の分布モデル」として変数Ｍに代入されたデータを、「観測された総リード数」として変数Ｄに代入された値を用いて閾値決定フローを実行し、所定の有意水準で変異の有無を判定するための閾値を算出する（算出工程）。閾値決定フローの詳細については後述する。 In step S409, the processing unit 11 determines that the data substituted for the variable X as “normal specimen data for the target mutation” and the data substituted for the variable M as the “target mutation distribution model” A threshold determination flow is executed using the value substituted for the variable D as the “total number of reads”, and a threshold for determining the presence or absence of a mutation at a predetermined significance level is calculated (calculation step). Details of the threshold value determination flow will be described later.

続いて、ステップＳ４１０において、処理部１１は、変数Ａと、ステップＳ４０９において算出した閾値とを比較し、変数Ａ≧閾値である場合（Ｙｅｓ）、陽性と判定し（ステップＳ４１１、判定工程）、変数Ａ≧閾値でない場合（Ｎｏ）、陰性と判定する（ステップＳ４１２）。 Subsequently, in step S410, the processing unit 11 compares the variable A with the threshold calculated in step S409, and when variable A ≧ the threshold (Yes), determines that the variable A is positive (step S411, determination step). If variable A is not equal to or larger than the threshold value (No), it is determined to be negative (step S412).

以上を判定対象の各変異について行う。なお、本実施形態において、判定対象となるＩｎＤｅｌ変異は、特定のタイプのＩｎＤｅｌ変異（すなわち、予め定められた塩基の挿入および予め定められた塩基の欠失の少なくとも一方からなる変異）である。そして、検出用リファレンス配列には、互いに異なる変異に対応する複数の検出用リファレンス配列が含まれており、ステップＳ４０２では、検出用リファレンス配列のうち、判定対象となる特定のタイプのＩｎＤｅｌ変異に対応する検出用リファレンス配列にマッピングされたリードを変異リードとして抽出する。これにより、特定のタイプのＩｎＤｅｌ変異の有無を首尾良く判定することができる。これにより、何らかのＩｎＤｅｌ変異が存在するか否かの判定しかできなかった従来技術に比べ、ＩｎＤｅｌ変異をタイプ毎に検出することができるため、より詳細な変異情報を得ることができる。また、ＩｎＤｅｌ変異の有無を判定するための閾値を、タイプ毎に算出するため、より精度高い判定を行うことができる。 The above is performed for each mutation to be determined. In the present embodiment, the InDel mutation to be determined is a specific type of InDel mutation (that is, a mutation consisting of at least one of insertion of a predetermined base and deletion of a predetermined base). The detection reference sequence includes a plurality of detection reference sequences corresponding to mutations different from each other. In step S402, the detection reference sequence corresponds to the specific type of InDel mutation to be determined among the detection reference sequences. The read mapped to the detection reference sequence to be detected is extracted as a mutant read. Thus, the presence or absence of a specific type of InDel mutation can be determined successfully. By this means, since the InDel mutation can be detected for each type as compared with the related art in which it was only possible to determine whether or not some InDel mutation exists, more detailed mutation information can be obtained. In addition, since a threshold value for determining the presence or absence of the InDel mutation is calculated for each type, more accurate determination can be performed.

また、一実施形態において、上述のＩｎＤｅｌの検出処理と並行して、ステップＳ４０２において、従来技術と同様に、何れかの検出用リファレンスにマッピングされた対応リードを全て変異リードとして抽出することにより、何らかのＩｎＤｅｌ変異が存在するか否かの判定をおこなってもよい。これにより、検出用リファレンスに対応していないＩｎＤｅｌ変異が存在していた場合にも、何らかのＩｎＤｅｌ変異が存在することを検出することができる。 Further, in one embodiment, in parallel with the above-described InDel detection processing, in step S402, as in the related art, all the corresponding reads mapped to any of the detection references are extracted as mutant reads. It may be determined whether or not any InDel mutation exists. Accordingly, even when an InDel mutation that does not correspond to the detection reference exists, it is possible to detect the presence of some InDel mutation.

また、一実施形態において、検出用リファレンス配列に、上述したリードエラーに対応するための検出用リファレンス配列が含まれている場合、ステップＳ４０２において、リードエラーを含まない判定対象の変異に対応する検出用リファレンス配列にマッピングされたリードと、当該検出用リファレンス配列に対してリードエラーが反映された、リードエラーに対応するための検出用リファレンス配列にマッピングされたリードとの両方のリードを、判定対象の変異に対応する変異リードとして抽出してもよい。これにより、リードエラーを含むために判定対象の変異に対応する検出用リファレンスにマッピングされなかったリードも、リードエラーに対応するための検出用リファレンス配列にマッピングされることにより、判定対象の変異に対応する変異リードとして抽出することができる。これにより、誤判定を防ぎ、より高精度の判定を行うことができる。 In one embodiment, when the detection reference sequence includes the detection reference sequence for coping with the read error described above, in step S402, the detection corresponding to the mutation to be determined which does not include the read error is performed. Of the read mapped to the reference sequence for detection and the read mapped to the reference sequence for detection corresponding to the read error in which the read error is reflected on the reference sequence for detection, May be extracted as a mutation read corresponding to the mutation of Thereby, a read that is not mapped to the reference for detection corresponding to the mutation to be determined due to including a read error is also mapped to the reference sequence for detection to correspond to the read error, so It can be extracted as the corresponding mutant read. Thereby, erroneous determination can be prevented, and more accurate determination can be performed.

最後に、得られた変異の判定結果を出力する。例えば、処理部１１は、Ｗｅｂブラウザに対してＨＴＭＬ形式で結果を出力してもよい。また、処理部１１は、変異の判定結果をテキスト形式のレポートとして出力してもよい。 Finally, the obtained mutation determination result is output. For example, the processing unit 11 may output a result to a Web browser in an HTML format. In addition, the processing unit 11 may output the determination result of the mutation as a text report.

＜分布モデル選択作業フロー＞
ステップＳ３０９およびＳ４０９における閾値決定フローを説明する前に、変異毎に、閾値を算出するために用いる分布モデルを選択する分布モデル選択作業フローについて説明する。一実施形態において、分布モデル選択作業フローは、予め行われており、その結果が、変異判定プログラム１００の統計モデル情報１１０に格納されている。 <Distribution model selection work flow>
Before describing the threshold value determination flow in steps S309 and S409, a distribution model selection work flow for selecting a distribution model used to calculate a threshold value for each mutation will be described. In one embodiment, the distribution model selection work flow is performed in advance, and the result is stored in the statistical model information 110 of the mutation determination program 100.

閾値設定のための統計モデルは、正常検体データからあらかじめ作成されているものが適用される。ここで正常検体とは、変異を有していないことが実証されている検体のことを指し、正常検体データは、正常検体を判定対象の検体として、本実施形態に係る変異判定方法における抽出工程（ステップＳ３０３、Ｓ４０２）までを行って得られる対応リードの数（総リード数）および変異リード数の組を指す。正常検体データには、複数の正常検体から得られたデータが含まれている。本実施形態では、このような正常検体データから、バックグラウンドノイズを予測できる統計モデルを予め作成しておく。具体的には、図５に示されているフローにしたがって、閾値決定に使用する分布モデルを選択する。 As a statistical model for setting a threshold, a model created in advance from normal sample data is applied. Here, the normal sample refers to a sample that has been proven not to have a mutation, and the normal sample data is an extraction step in the mutation determination method according to the present embodiment, using a normal sample as a determination target sample. (Steps S303, S402) indicates a set of the number of corresponding reads (total number of reads) and the number of mutated reads obtained by performing steps up to (Steps S303 and S402). The normal sample data includes data obtained from a plurality of normal samples. In the present embodiment, a statistical model that can predict background noise from such normal sample data is created in advance. Specifically, a distribution model to be used for determining a threshold is selected according to the flow shown in FIG.

以下に統計モデルを作成し、適切な統計モデルを選択するための手順および統計モデルを選択するための基準を説明する。以下では、コンピュータ１０が作業者の判断を補助し、最終的には作業者が判断する形態について説明する。 A procedure for creating a statistical model, selecting an appropriate statistical model, and criteria for selecting a statistical model will be described below. Hereinafter, a description will be given of a mode in which the computer 10 assists the determination of the worker, and finally the determination is made by the worker.

処理部１１は、対象の変異について、正常検体データから総リード数および変異リード数の組を得る（ステップＳ５０１）。続いて、総合的な判断の補助とするため、処理部１１は、正常検体データにおける総リード数と変異リード数との関係を、データの散布図およびヒストグラムとして図示しない表示手段に表示させる（ステップＳ５０２）。そして、作業者が、各モデルにおける分布の様子を図として目視確認することによって、妥当性の検討の判断材料とする（ステップＳ５０３）。 The processing unit 11 obtains a set of the total number of reads and the number of mutated reads from the normal sample data for the target mutation (step S501). Subsequently, in order to assist comprehensive judgment, the processing unit 11 displays the relationship between the total number of reads and the number of mutated reads in the normal sample data on a display unit (not shown) as a scatter diagram and a histogram of the data (step). S502). Then, the worker visually confirms the state of the distribution in each model as a figure, thereby using the distribution as a judgment material for examining the validity (step S503).

続いて、さらに分布モデルの妥当性を確認するために、各モデルのパラメータの推定と妥当性の検討を行う。例えば、処理部１１は、まず、分布モデルとしてポアソン分布を適用し、正常検体データの総リード数および変異リード数を、当該分布モデルの一般化線形モデル（ＧＬＭ）に当てはめる。そして、処理部１１、当てはめたモデルについて、ＰＰプロット、Ｋ−Ｓ（Kolmogorov-Smirnov（コルモゴロフ−スミルノフ））検定結果、逸脱度およびＡＩＣの各々を算出し、図示しない表示部に表示する。作業者は、表示されたデータを目視確認することによって妥当性の検討の判断材料とする（ステップＳ５０４）。次に、処理部１１は、分布モデルとして負の二項分布を適用し、ポアソン分布を適用した場合と同様に正常検体データの分布モデルへ当てはめる。その後、ステップＳ５０４と同様に、ＰＰプロット、Ｋ−Ｓ検定結果、逸脱度およびＡＩＣの表示を行い、作業者がこれを目視確認する（ステップＳ５０５）。 Subsequently, in order to further confirm the validity of the distribution model, the parameters of each model are estimated and the validity is examined. For example, the processing unit 11 first applies a Poisson distribution as a distribution model, and applies the total number of reads and the number of variant reads of normal sample data to a generalized linear model (GLM) of the distribution model. Then, the processing unit 11 calculates the PP plot, the KS (Kolmogorov-Smirnov (Kolmogorov-Smirnov)) test result, the deviance, and the AIC for the fitted model, and displays them on a display unit (not shown). The operator visually checks the displayed data to determine the validity (step S504). Next, the processing unit 11 applies a negative binomial distribution as a distribution model, and applies the same to the distribution model of the normal specimen data in the same manner as when the Poisson distribution is applied. After that, similarly to step S504, the PP plot, the KS test result, the deviation, and the AIC are displayed, and the operator visually confirms them (step S505).

次に、作業者は、処理部１１は、それぞれのモデルについて、疎なデータか否かを確認する（ステップＳ５０６）。ここで、疎なデータとは、ＲＥＲが、０がほとんどでわずかに小さい値が入る、またはすべて０で構成されているようなデータを指す。 Next, the processing unit 11 checks whether or not each model has sparse data (step S506). Here, the sparse data refers to data in which the RER is mostly 0 and has a slightly smaller value, or is composed of all 0s.

ステップＳ５０６において、疎なデータであると判断された場合（Ｙｅｓ）、「最小値」モデルを使用する（ステップＳ５０７）。具体的には、変異判定プログラム１００の最小値情報１０３には、閾値の最小値を使用することを示す情報を当該変異に対応付けて、格納する。 If it is determined in step S506 that the data is sparse (Yes), the “minimum value” model is used (step S507). Specifically, information indicating that the minimum value of the threshold is used is stored in the minimum value information 103 of the mutation determination program 100 in association with the mutation.

ステップＳ５０６において、疎なデータではないと判断された場合（Ｎｏ）、作業者は、適合する統計モデルを選択する（ステップＳ５０８）。このとき、（ｉ）ＰＰプロットの形は直線に近い方が当てはまりが良い、（ｉｉ）Ｋ−Ｓ検定のｐ値は大きい方が当てはまりが良い、（ｉｉｉ）ＡＩＣは低い方が良い予測ができている、および（ｉｖ）残渣逸脱度／自由度が１を大きく超えるようであれば過分散となっている、などのポイントを踏まえて、総合的に判断すればよい。 If it is determined in step S506 that the data is not sparse data (No), the worker selects a suitable statistical model (step S508). At this time, (i) the better the shape of the PP plot is closer to a straight line, the better the (ii) the larger the p-value of the KS test, and (iii) the lower the AIC, the better the prediction. And (iv) overdispersion if the degree of residue deviation / degree of freedom greatly exceeds one.

そして、処理部１１は、変異判定プログラム１００の統計モデル種別１１１に、作業者が選択した統計モデルの種別を格納し、正常検体データ１１２に正常検体データを格納する。また、他の実施形態において、正常検体データを格納する代わりに、ステップＳ５０４またはＳ５０５において作成した統計モデルを示す情報を、変異判定プログラム１００の統計モデル情報１１０に格納してもよい。また、変異判定プログラム１００の最小値情報１０３には、閾値の最小値を使用しないことを示す情報を当該変異に対応付けて格納してもよい。 Then, the processing unit 11 stores the type of the statistical model selected by the operator in the statistical model type 111 of the mutation determination program 100, and stores the normal sample data in the normal sample data 112. In another embodiment, instead of storing the normal sample data, information indicating the statistical model created in step S504 or S505 may be stored in the statistical model information 110 of the mutation determination program 100. Further, information indicating that the minimum value of the threshold is not used may be stored in the minimum value information 103 of the mutation determination program 100 in association with the mutation.

＜閾値の補正＞
本実施形態では、正常検体の解析結果のデータセットから統計モデルを作成し、閾値の算出に使用している。このデータセットから、一定の基準で外れ値を除外することで、閾値を補正してもよい。 <Correction of threshold>
In the present embodiment, a statistical model is created from a data set of analysis results of normal samples, and is used for calculating a threshold. The threshold may be corrected by excluding outliers from this data set on a fixed basis.

外れ値の決定方法の一例としては、まず、閾値算出に用いた正常検体のデータについて、使用する統計モデルとの適合度を算出する。統計モデルとの適合度はｐ値に基づいて判定される。ｐ値が低いほど統計モデルからは外れていることを示す。各変異全てについて、同様にｐ値を算出する。続いて、各変異全てのｐ値を数値の大きさ順に並べ、外れ値の基準となるｐ値を決定する。外れ値は、例えばｐ＜０．０２である。 As an example of a method of determining an outlier, first, the degree of conformity with the statistical model to be used is calculated for the data of the normal sample used for the threshold calculation. The degree of conformity with the statistical model is determined based on the p-value. The lower the p-value, the more out of the statistical model. The p-value is similarly calculated for all mutations. Subsequently, the p-values of all the mutations are arranged in the order of the numerical values, and the p-value as a reference for the outlier is determined. The outlier is, for example, p <0.02.

また、変異判定方法において新たに入力されたデータを供試データとして、フィードバックすることにより、統計モデルを修正してもよい。 Further, the statistical model may be corrected by feeding back data newly input in the mutation determination method as test data.

＜検出限界（limit of detection（ＬＯＤ))＞
本発明の方法において用いられる検出限界は、正常検体および陽性検体のデータから算出し得る。ただし、検出限界の設定は、完全に自動化せず、プログラムを実行する当業者が手動で設定を行ってもよい。 <Limit of detection (LOD)>
The detection limit used in the method of the present invention can be calculated from data of a normal sample and a positive sample. However, the setting of the detection limit may not be completely automated, and may be manually performed by those skilled in the art executing the program.

上述の非特許文献１では、閾値は、正常検体のエラーのみを考慮して決められている。しかし、実際に検出限界を設定するためには、正常検体のみでなく、陽性検体のデータも必要となると考えられる。検出限界の設定方法については、例えば、判定対象サンプルにおける変異型リードの数が、閾値を下回る確率が一定水準以下となるような変異型リードの割合を求め、その最小値を検出限界とすることなどが考えられる。ただし、検出限界の設定は閾値の設定時と異なり、正常型を変異型と誤って判定してしまう可能性だけでなく、変異型を正常型と判定してしまう可能性も考慮する必要がある。 In Non-Patent Document 1 described above, the threshold is determined in consideration of only an error of a normal sample. However, in order to actually set the detection limit, it is considered that not only data of a normal sample but also data of a positive sample are required. For the method of setting the detection limit, for example, determine the percentage of mutant reads such that the probability that the number of mutant reads in the sample to be determined falls below a threshold is equal to or lower than a certain level, and determine the minimum value as the detection limit. And so on. However, setting the detection limit is different from the setting of the threshold, and it is necessary to consider not only the possibility of erroneously determining a normal type as a mutant type but also the possibility of determining a mutant type as a normal type. .

本発明に適用されるＬＯＤの算出方法としては、公知の方法（therascreen（登録商標） KRAS RGQ PCRキットの説明書（http://www.accessdata.fda.gov/cdrh_docs/pdf11/P110027c.pdf））を改良した方法を用いることができ、例えば、予め、以下の１）〜５）の手順の方法によって算出する方法が挙げられる。
１）対象物質の濃度が既知であるサンプルの希釈を５〜１０段階について行う。
２）希釈段階毎に、１０〜２０回の反復実験を行えるようサンプルを準備する。
３）対象物質の検出を行い、各希釈段階において、反復回数に占める陽性の回数の割合を求める。
４）各希釈段階における濃度を独立変数、反復回数に占める陽性の回数の割合を従属変数とする関数を求める。
５）４）で求められた関数から、反復回数に占める陽性の割合が９５％となるような希釈濃度を求める。 As a method of calculating the LOD applied to the present invention, a known method (therascreen (registered trademark) KRAS RGQ PCR kit instruction manual (http://www.accessdata.fda.gov/cdrh_docs/pdf11/P110027c.pdf)) Can be used. For example, there is a method of calculating in advance by the following methods 1) to 5).
1) Dilution of a sample whose concentration of the target substance is known is performed in 5 to 10 steps.
2) For each dilution step, prepare samples so that 10-20 repetitions can be performed.
3) The target substance is detected, and the ratio of the number of positives to the number of repetitions in each dilution step is determined.
4) A function in which the concentration in each dilution step is an independent variable and the ratio of the number of positives to the number of repetitions is a dependent variable is determined.
5) From the function obtained in 4), obtain a dilution concentration such that the ratio of positives to the number of repetitions is 95%.

なお、上記１）について、サンプルの希釈段階の一例は、サンプル原液を１００％としたときに、０．０２５％、０．０５％、０．１％、０．５％、１％および１００％の６段階である。また、上記２）において、反復実験回数（Ｎ）は例えば、Ｎ＝３、６または９である。上記３）において、具体的な例示としては陽性が１２回の反復中９回が陽性などの値から陽性の回数の割合を求める。 Regarding the above 1), an example of the sample dilution step is as follows: 0.025%, 0.05%, 0.1%, 0.5%, 1%, and 100% when the sample stock solution is 100%. There are six stages. In the above 2), the number of repetitive experiments (N) is, for example, N = 3, 6, or 9. In the above 3), as a specific example, the ratio of the number of positive times is determined from a value such as 9 positives out of 12 repeated positives.

また、上記４）について、関数は、例えば、ロジスティック回帰分析によって求められる。関数のモデルは以下式（１）の通りである。 Regarding 4), the function is obtained by, for example, logistic regression analysis. The model of the function is as follows:

ここで、Ｘは各希釈段階における濃度の対数（底は２）であり、ｐは１−反復回数に占める陽性の割合の閾値であり、β_０、β_１はそれぞれロジスティック回帰分析によって求められる係数である。例えば閾値を９５％としたい場合、ｐ＝０．０５となる。 Here, X is the logarithm of the concentration in each dilution step (base 2), p is 1-the threshold of the percentage of positives in the number of repetitions, and β ₀ and β ₁ are coefficients obtained by logistic regression analysis, respectively. It is. For example, if the threshold value is to be 95%, p = 0.05.

ここで、一例として、反復回数に占める陽性の割合が９５％となるような希釈濃度は、以下の式（２）から求められる。 Here, as an example, the dilution concentration at which the ratio of positives to the number of repetitions is 95% is obtained from the following equation (2).

さらに、ＬＯＤの９５％信頼区間をブートストラップ法などによって推定し、その信頼区間の右端が特定の値を下回っていることを確認する工程を実施してもよい。 Further, a step of estimating a 95% confidence interval of the LOD by a bootstrap method or the like and confirming that the right end of the confidence interval is below a specific value may be performed.

ブートストラップ法の工程は以下の１）〜４）の通りである。
１）元のデータセット（Ｎ＝ｎ）から同数（ｎ個）のサンプルを復元抽出する。
２）復元抽出によって得られた新たなデータセットを対象にロジスティック回帰分析を行い、閾値となる濃度を推定する。
３）１）および２）をＢ（＝１，０００〜１０，０００）回繰り返す。
４）３）によって得られたＢ個の推定値の分布から信頼区間を計算する。例えば９５％信頼区間であれば（１００±９５）／２として計算し、２．５と９７．５パーセンタイルで挟まれた区間とする。 The steps of the bootstrap method are as follows 1) to 4).
1) Restore and extract the same number (n) of samples from the original data set (N = n).
2) Logistic regression analysis is performed on the new data set obtained by the restoration extraction to estimate a threshold density.
3) 1) and 2) are repeated B (= 1,000 to 10,000) times.
4) A confidence interval is calculated from the distribution of the B estimated values obtained in 3). For example, if it is a 95% confidence interval, it is calculated as (100 ± 95) / 2, and the interval is set between the 2.5 and 97.5 percentiles.

なお、すべての希釈段階で１００％陽性の結果であった場合、正しい計算ができない。そのため、この場合は例外として「＜０．０２５％」のように表記するように設定する。 If the results are 100% positive in all dilution steps, correct calculations cannot be made. Therefore, in this case, an exception is set so as to be described as “<0.025%”.

＜閾値決定フロー＞
次に、閾値を設定する。ここで、シーケンシングによって得られる総リード数は、サンプル間で異なる。例えば、上述のマルチプレックス法などでも、サンプル間でシーケンシングの反応のバラツキやサンプルの量の誤差などで総リード数のバラツキが生じる。また、総リード数は、シーケンシングされる配列長によっても異なる。従来のシーケンスデータ解析の手法では、このような総リード数のバラツキを均一化するため、従来技術では、サンプル間でリード数の正規化がされていた。例えば、Kukita et. al.（非特許文献１）では、閾値を設定および判定するために、実際のリード数（Ｄｅｐｔｈ）を、１００，０００リードに換算した値を用いている。しかし、閾値は実際にはｄｅｐｔｈの大きさによって変化する。ｄｅｐｔｈの値が大きい（つまり、リード数が多い）ほど、閾値は低下する。したがって、ｄｅｐｔｈの値によって、分布モデルにあてはめるときのパラメータが変化することによってリードの分布が変化し、閾値も変化する。本発明の方法では、これに対する対策として、（１）最低限必要なリード数を設定すること、および（２）閾値を固定せず、判定時に実際のリード数から閾値を計算することを実行する。 <Threshold determination flow>
Next, a threshold is set. Here, the total number of reads obtained by sequencing differs between samples. For example, even in the multiplex method described above, the total number of reads varies due to variations in the sequencing reaction between samples and errors in the amount of samples. The total number of reads also differs depending on the sequence length to be sequenced. In the conventional sequence data analysis method, in order to equalize such variation in the total number of reads, in the related art, the number of reads is normalized between samples. For example, in Kukita et. Al. (Non-Patent Document 1), in order to set and determine a threshold value, a value obtained by converting the actual number of leads (Depth) into 100,000 leads is used. However, the threshold value actually changes depending on the depth. The threshold value decreases as the value of the depth increases (that is, the number of reads increases). Therefore, depending on the value of the depth, the parameter of the distribution model changes and the distribution of leads changes, and the threshold also changes. In the method of the present invention, as measures against this, (1) setting the minimum required number of leads, and (2) calculating the threshold from the actual number of leads at the time of determination without fixing the threshold are executed. .

図６に、本発明の一実施形態に係る閾値決定処理のフローを示す。処理部１１は、正常検体データに基づいて選択された分布モデルから閾値を算出する。なお、正常検体データが疎なデータ（殆どが０であるデータ）であった場合には、処理部１１は、予め定められた分布モデル（ポアソン分布）から算出した閾値の最小値を閾値として算出する。 FIG. 6 shows a flow of a threshold value determination process according to an embodiment of the present invention. The processing unit 11 calculates a threshold from the distribution model selected based on the normal sample data. If the normal sample data is sparse data (data that is almost 0), the processing unit 11 calculates the minimum value of the threshold value calculated from the predetermined distribution model (Poisson distribution) as the threshold value. I do.

まず、処理部１１は、対象の変異についての正常検体データ、対象変異の分布モデル、観測された総リード数を、閾値を算出するためのワークメモリに入力する（ステップＳ７０１）。 First, the processing unit 11 inputs the normal sample data of the target mutation, the distribution model of the target mutation, and the observed total number of reads to a work memory for calculating a threshold (step S701).

次に、処理部１１は、最小値情報１０３を参照して、対象の変異が上述の閾値の最小値を用いるものか、それ以外の分布モデルを用いるものかを判定する（ステップＳ７０２）。閾値の最小値を用いる場合、処理部１１は、λ＝観測された総リード数（Ｄｅｐｔｈ）／１００，０００としたポアソン分布から求めた値（閾値の最小値）を閾値として出力する（ステップＳ７１１）。 Next, the processing unit 11 refers to the minimum value information 103 and determines whether the target mutation uses the minimum value of the above-described threshold or uses another distribution model (step S702). When the minimum value of the threshold value is used, the processing unit 11 outputs, as a threshold value, a value (minimum value of the threshold value) obtained from the Poisson distribution where λ = total number of observed leads (Depth) / 100,000 (step S711). ).

このように、本実施形態では、データの信頼性を高めるため、（ｉ）分布モデルの判定時において、正常検体データが疎なデータ（殆どが０であるデータ）であった場合には、最小値情報１０３が閾値の最小値を使用することを示すようにし、（ｉｉ）閾値の決定時には、λ＝観測された総リード数（Ｄｅｐｔｈ）／１００，０００としたポアソン分布から求めた下記表１に示すような値を閾値の最小値として求め、当該閾値の最小値を閾値として用いる。 As described above, in the present embodiment, in order to increase the reliability of data, (i) when normal sample data is sparse data (data that is almost 0) at the time of determination of a distribution model, The value information 103 indicates that the minimum value of the threshold value is used. (Ii) When determining the threshold value, the following Table 1 obtained from a Poisson distribution where λ = total number of observed leads (Depth) / 100,000 Is obtained as the minimum value of the threshold value, and the minimum value of the threshold value is used as the threshold value.

一方、分布モデルを用いる場合、処理部１１は、統計モデル種別１１１を参照して、ポアソン分布または負の二項分布のいずれかを選択する（ステップＳ７０３）。処理部１１は、選択したモデルを用いて、観測された総リード数に対する予測エラー数を推定し、閾値を出力する（ステップＳ７０４〜Ｓ７１３）。 On the other hand, when using the distribution model, the processing unit 11 refers to the statistical model type 111 and selects either the Poisson distribution or the negative binomial distribution (step S703). Using the selected model, the processing unit 11 estimates the number of prediction errors with respect to the observed total number of reads, and outputs a threshold value (steps S704 to S713).

詳細には、ポアソンの分布モデルを選択した場合は、処理部１１は、正常検体データの総リード数および変異リード数を、ポアソンン分布の一般化線形モデル（ＧＬＭ）に当てはめる（ステップＳ７０４）。次いで、処理部１１は、当てはめたモデルを用いて観測された総リード数に対する予測エラー数を推定し、λに設定する（ステップＳ７０５）。この値に基づいて、処理部１１は、ポアソン分布に基づく閾値算出処理を行う（ステップＳ７０６）。 Specifically, when the Poisson distribution model is selected, the processing unit 11 applies the total number of reads and the number of variant reads of the normal specimen data to the generalized linear model (GLM) of the Poisson distribution (step S704). Next, the processing unit 11 estimates the number of prediction errors with respect to the total number of reads observed using the fitted model, and sets it to λ (step S705). Based on this value, the processing unit 11 performs a threshold value calculation process based on the Poisson distribution (Step S706).

処理部１１は、負の二項分布を選択した場合は、正常検体データの総リード数および変異リード数を、負の二項分布の一般化線形モデル（ＧＬＭ）に当てはめる（ステップＳ７０７）。処理部１１は、当てはめたモデルを用いて推定されたサイズパラメータを推定し、θに設定する（ステップＳ７０８）。処理部１１は、続いて当てはめたモデルを用いて観測された総リード数に対する予測エラー数を推定し、μに設定する（ステップＳ７０９）。処理部１１は、これらの値に基づいて負の二項分布の閾値算出処理を行う（ステップＳ７１０）。処理部１１は、その後閾値を出力する（ステップＳ７１３）。 When the negative binomial distribution is selected, the processing unit 11 applies the total number of reads and the number of variant reads of the normal sample data to the generalized linear model (GLM) of the negative binomial distribution (step S707). The processing unit 11 estimates the size parameter estimated using the fitted model, and sets it to θ (step S708). The processing unit 11 then estimates the number of prediction errors with respect to the total number of reads observed using the fitted model, and sets it to μ (step S709). The processing unit 11 performs a negative binomial distribution threshold calculation process based on these values (step S710). After that, the processing unit 11 outputs the threshold value (Step S713).

以下に、ポアソン分布および負の二項分布のそれぞれを用いた場合の閾値の設定フローについて、さらに詳細に記載している。 The following describes the threshold setting flow in the case of using each of the Poisson distribution and the negative binomial distribution in more detail.

＜ポアソン分布を用いた閾値の設定＞
図７に、本発明の一実施形態に係るポアソン分布の閾値決定処理のフローを示す。 <Setting of threshold using Poisson distribution>
FIG. 7 shows a flow of a Poisson distribution threshold determination process according to an embodiment of the present invention.

予測エラー数をλとし、端数を切り捨てた値をｉｎｔ（λ）（＝Ｎ）とする（ステップＳ８０２およびＳ８０３）。観測された総リード数（Ｄ）を入力する（ステップＳ８０１）。予測エラー数Ｎが、観測された総リード数Ｄよりも小さい場合（Ｎ＜Ｄ）（ステップＳ８０４）、パラメータλのポアソン分布を仮定したときのエラー数がＮ個以上である確率（ｐ）を算出する（ステップＳ８０５）。ｐが２^−５以下である場合（ステップＳ８０６）、Ｎから閾値（Ｔ）を算出する（ステップＳ８０８およびＳ８１０）。また、予測エラー数Ｎが、観測された総リード数Ｄ以上の場合（ステップＳ８０４）、総リード数Ｄから閾値（Ｔ）を算出する（ステップＳ８０９およびＳ８１０）。ｐが２^−５より大きい場合（ステップＳ８０６）、Ｎ＋１を用いて、確率ｐを算出し、確率ｐの算出後のフローは、上記の工程を繰り返す（ステップＳ８０７）。 It is assumed that the number of prediction errors is λ, and a value obtained by rounding down a fraction is int (λ) (= N) (steps S802 and S803). The observed total number of leads (D) is input (step S801). When the number N of predicted errors is smaller than the total number D of observed leads (N <D) (step S804), the probability (p) that the number of errors is N or more when the Poisson distribution of the parameter λ is assumed is calculated. It is calculated (step S805). If p is equal to or smaller than 2 ⁻⁵ (step S806), a threshold (T) is calculated from N (steps S808 and S810). If the prediction error number N is equal to or greater than the total read number D (step S804), a threshold (T) is calculated from the total read number D (steps S809 and S810). If p is greater than 2 ⁻⁵ (step S806), the probability p is calculated using N + 1, and the flow after the calculation of the probability p repeats the above steps (step S807).

＜負の二項分布を用いた閾値の設定＞
図８に、本発明の一実施形態に係る負の二項分布の閾値決定処理のフローを示す。 <Setting of threshold using negative binomial distribution>
FIG. 8 shows a flow of the threshold value determination processing of the negative binomial distribution according to the embodiment of the present invention.

予測エラー数をμ、サイズパラメータをθとし、端数を切り捨てた値をint（μ）（＝Ｎ）とする（ステップＳ９０２およびＳ９０３）。観測された総リード数（Ｄ）を入力する（ステップＳ９０１）。予測エラー数Ｎが、観測された総リード数Ｄよりも小さい場合（Ｎ＜Ｄ）（ステップＳ９０４）、パラメータλのポアソン分布を仮定したときのエラー数がＮ個以上である確率（ｐ）を算出する（ステップＳ９０５）。ｐが２^−５以下である場合（ステップＳ９０６）、Ｎから閾値（Ｔ）を算出する（ステップＳ９０８およびＳ９１０）。また、予測エラー数Ｎが、観測された総リード数Ｄ以上の場合、総リード数Ｄから閾値（Ｔ）を算出する（ステップＳ９０９）。ｐが２^−５より大きい場合（ステップＳ９０６）、Ｎ＋１を用いて、確率ｐを算出し、確率ｐの算出後のフローは、上記の工程を繰り返す（ステップＳ９０７）。 It is assumed that the number of prediction errors is μ, the size parameter is θ, and a value obtained by rounding down a fraction is int (μ) (= N) (steps S902 and S903). The observed total number of leads (D) is input (step S901). When the number N of predicted errors is smaller than the total number D of observed leads (N <D) (step S904), the probability (p) that the number of errors is N or more when the Poisson distribution of the parameter λ is assumed is determined. It is calculated (step S905). If p is ^2-5 or less (step S906), a threshold (T) is calculated from N (steps S908 and S910). If the predicted error number N is equal to or larger than the total read number D, a threshold (T) is calculated from the total read number D (step S909). If p is greater than 2 ⁻⁵ (step S906), the probability p is calculated using N + 1, and the flow after the calculation of the probability p repeats the above steps (step S907).

このように、作成したモデルから、所定の有意水準（本実施形態においては、偽陽性が５００００サンプル中１サンプル未満）となるように閾値を設定する。 In this way, the threshold is set from the created model such that the predetermined significance level (in this embodiment, the number of false positives is less than 1 out of 50,000 samples).

換言すれば、閾値の設定は、異なる変異毎、かつ、異なるリード数毎、かつ、異なるシーケンスリードサンプル毎について行われる。 In other words, the threshold is set for each different mutation, each different number of reads, and each different sequence read sample.

＜変異判定時のアラート＞
一実施形態において、例えば、ステップＳ３０４、Ｓ４０３およびＳ４０４において、対応リードから除外されたリードが存在する場合、処理部１１は、例えば判定結果にアラートを含めて出力するようにしてもよい（警告工程）。アラートは、何か異常な状態であることを知らせる表示であり、処理部１１は、公知のデータベースに存在しない変異の存在が検出された場合、またはシーケンシングエラーなどで多くのリードが同じ塩基置換を有している場合に出力する。一つの観点において、アラートは、対象遺伝子配列が癌関連遺伝子である場合に、例えばＣＯＳＭＩＣデータベースなどの癌遺伝子データベースにないマイナーな変異が存在した場合に表示され得る。また、別の観点において、アラートは、既知変異であるが、判定時に判定対象としなかった変異が存在した場合に表示され得る。 <Alert for mutation judgment>
In one embodiment, for example, when there is a lead excluded from the corresponding leads in steps S304, S403, and S404, the processing unit 11 may output the determination result including an alert, for example (alert process). ). The alert is a display notifying that something is in an abnormal state. The processing unit 11 determines that many reads have the same base substitution when the presence of a mutation that does not exist in a known database is detected or due to a sequencing error or the like. Is output if it has In one aspect, an alert can be displayed when the gene sequence of interest is a cancer-related gene, for example, when there is a minor mutation that is not in the cancer gene database such as the COSMIC database. Further, from another viewpoint, the alert may be displayed when there is a mutation that is a known mutation but is not set as a determination target at the time of determination.

具体的には、アラートを表示する条件としては、ステップＳ３０４、Ｓ４０３、Ｓ４０４において除外されたリードの中に、複数のリファレンス配列に対する一致度が同程度であり、複数のリファレンス配列にマッピング可能と判定されたリード、一致度が同じリファレンスが複数あるとされたもの、あるいは、あるリファレンス（ＩｎＤｅｌテンプレート）配列に割り振られたが、そのリファレンス配列に対してさらに変異（ＳＮＶおよび／またはＩｎＤｅｌ）が存在すると判断されたもの（False Positive）が存在する場合が挙げられる。 More specifically, the conditions for displaying an alert are such that among the reads excluded in steps S304, S403, and S404, the degree of coincidence with a plurality of reference sequences is substantially the same, and it is determined that mapping to a plurality of reference sequences is possible. If a reference (InDel template) sequence is assigned to a plurality of read or identical references, or is assigned to a certain reference (InDel template) sequence, but there is a further mutation (SNV and / or InDel) to that reference sequence, There is a case where a judgment (False Positive) exists.

＜コンピュータおよびプログラム＞
コンピュータ１０の処理部１１は、各機能を実現するソフトウェアであるプログラム（変異判定プログラム１００を含む）の命令を実行するＣＰＵ（Central Processing Unit）、上記プログラムを展開するＲＡＭ（Random Access Memory）などを備えている。コンピュータ１０の記憶部１２は、上記プログラムおよび各種データがコンピュータ（またはＣＰＵ）で読み取り可能に記録されたＲＯＭ（Read Only Memory）または記憶装置（これらを「記録媒体」と称する）を備えている。そして、コンピュータ１０（または処理部１１）が上記プログラムを記録部１２から読み取って実行することにより、本発明の目的が達成される。上記記録媒体としては、「一時的でない有形の媒体」、例えば、テープ、ディスク、カード、半導体メモリ、プログラマブルな論理回路などを用いることができる。また、上記プログラムは、該プログラムを伝送可能な任意の伝送媒体（通信ネットワークや放送波等）を介して上記コンピュータに供給されてもよい。なお、本発明は、上記プログラムが電子的な伝送によって具現化された、搬送波に埋め込まれたデータ信号の形態でも実現され得る。 <Computer and program>
The processing unit 11 of the computer 10 includes a CPU (Central Processing Unit) that executes instructions of a program (including the mutation determination program 100) that is software for realizing each function, a RAM (Random Access Memory) that expands the program, and the like. Have. The storage unit 12 of the computer 10 includes a ROM (Read Only Memory) or a storage device (these are referred to as “recording media”) in which the program and various data are recorded so as to be readable by a computer (or a CPU). Then, the object of the present invention is achieved by the computer 10 (or the processing unit 11) reading and executing the program from the recording unit 12. As the recording medium, a “temporary tangible medium” such as a tape, a disk, a card, a semiconductor memory, or a programmable logic circuit can be used. Further, the program may be supplied to the computer via an arbitrary transmission medium (a communication network, a broadcast wave, or the like) capable of transmitting the program. Note that the present invention can also be realized in the form of a data signal embedded in a carrier wave, in which the program is embodied by electronic transmission.

本実施形態に係る変異検出プログラムは、特に限定されないが、例えば、プラグイン、スタンドアローンアプリケーション、ミドルウェアまたはライブラリ等の形態で提供され得る。 The mutation detection program according to the present embodiment is not particularly limited, but may be provided in the form of, for example, a plug-in, a standalone application, middleware, a library, or the like.

本実施形態に係る変異検出プログラムを記憶する記録媒体の例としては、読取り専用メモリ（ＲＯＭ）、フロッピー（登録商標）ディスク、ハードディスク等の磁気メディア、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤーＲ、ＤＶＤＲＯＭ、ブルーレイディスク、等の光学メディアが挙げられる。 Examples of the recording medium that stores the mutation detection program according to the present embodiment include a read-only memory (ROM), a floppy (registered trademark) disk, a magnetic medium such as a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, and a CD-ROM. Optical media such as R, DVDROM, Blu-ray disc, and the like are included.

また、コンピュータ１０は、表示手段にグラフィック情報を処理および出力するためのグラフィックボードを１つ以上含むこともできる。上記コンポーネントはコンピュータ１０内のバスにより適切に相互接続できる。コンピュータ１０はさらに、外部サーバ２０との通信のための通信部１３の他、モニタ、キーボード、マウス、プリンタ、ネットワーク等の汎用外部コンポーネントと通信するための好適なインタフェースを含む。 Further, the computer 10 may include one or more graphic boards for processing and outputting graphic information on the display means. The above components can be suitably interconnected by a bus within computer 10. The computer 10 further includes a communication unit 13 for communication with the external server 20, and a suitable interface for communicating with general-purpose external components such as a monitor, a keyboard, a mouse, a printer, and a network.

＜変異の判定の対象となる遺伝子配列＞
本発明の変異の判定方法を適用する遺伝子配列は特に限定されないが、一例としては、特定の疾患関連遺伝子の配列である。疾患関連遺伝子配列としては、これに限定するものではないが、例えば、癌関連遺伝子配列などが挙げられる。癌関連遺伝子としては、これに限定するものではないが、例えば、上皮成長因子受容体（epidermal growth factor receptor，ＥＧＦＲ）遺伝子などが挙げられる。ＥＧＦＲは、膜貫通型糖タンパク質をコードしており、生体において癌の増殖および維持に関与していることが知られている。特に、ＥＧＦＲは、非小細胞肺癌などでその発現が確認されている。本発明は、例えば、このような遺伝子を対象とすることによって、遺伝子診断およびモニタリングに適用することができる。 <Gene sequence for mutation determination>
The gene sequence to which the mutation determination method of the present invention is applied is not particularly limited, but one example is a sequence of a specific disease-related gene. Examples of the disease-related gene sequence include, but are not limited to, a cancer-related gene sequence and the like. Examples of the cancer-related gene include, but are not limited to, an epidermal growth factor receptor (EGFR) gene. EGFR encodes a transmembrane glycoprotein and is known to be involved in the growth and maintenance of cancer in living organisms. In particular, the expression of EGFR has been confirmed in non-small cell lung cancer and the like. The present invention can be applied to genetic diagnosis and monitoring by targeting such genes, for example.

＜ポリヌクレオチド配列サンプル＞
変異を判定するのに使用するポリヌクレオチド配列サンプルは、細胞内に存在する配列由来であってもよく、無細胞（cell-free）系のポリヌクレオチド配列サンプルであってもよい。一実施形態において、ポリヌクレオチド配列サンプルは、生体試料から抽出し、調製されたポリヌクレオチド配列サンプルである。また、本発明のサンプルのポリヌクレオチド配列サンプルには、例えば、血中浮遊腫瘍ＤＮＡ（circulating tumor DNA、ｃｔＤＮＡ）も含まれている。 <Polynucleotide sequence sample>
The polynucleotide sequence sample used to determine the mutation may be derived from a sequence present in a cell, or may be a cell-free polynucleotide sequence sample. In one embodiment, the polynucleotide sequence sample is a polynucleotide sequence sample extracted and prepared from a biological sample. The polynucleotide sequence sample of the sample of the present invention also contains, for example, circulating tumor DNA (ctDNA) in blood.

本発明の一実施形態において、本発明に係る判定方法に適用する核酸サンプルは、例えば、組織、体液および細胞のような生体試料から調製されるものであり、好ましくは体液から調製されるものである。生体試料は被験者から採取される。また、生体試料からポリヌクレオチド配列サンプルを得る場合、従来公知の方法を用いて生体試料から核酸を精製または分離すればよい。なお、生体液としては、限定しないが、生体試料としては、例えば、細胞試料、組織試料、体液試料などが挙げられ、中でも体液試料が好ましい。体液試料としては、血液試料、リンパ液試料、髄液試料等が挙げられるが、血液試料が好ましく、特に末梢血試料が好ましい。末梢血試料は、例えば指先への穿刺等によって容易に採取が可能であるため、被験者の負担が少ない。同様に、生体試料は、生検、綿棒または塗抹等から採取することも可能である。また、血液試料を用いる場合は、採取した血液から分離した血清または血漿をポリヌクレオチド配列サンプルの調製に用いることが好ましい。 In one embodiment of the present invention, the nucleic acid sample applied to the determination method according to the present invention is, for example, one prepared from a biological sample such as a tissue, a body fluid and cells, and preferably one prepared from a body fluid. is there. A biological sample is taken from a subject. When a polynucleotide sequence sample is obtained from a biological sample, the nucleic acid may be purified or separated from the biological sample using a conventionally known method. The biological fluid is not limited, but the biological sample includes, for example, a cell sample, a tissue sample, a body fluid sample, and the like, and among them, a body fluid sample is preferable. Examples of the body fluid sample include a blood sample, a lymph fluid sample, and a cerebrospinal fluid sample, and a blood sample is preferable, and a peripheral blood sample is particularly preferable. The peripheral blood sample can be easily collected, for example, by puncturing a fingertip or the like, so that the burden on the subject is small. Similarly, a biological sample can be taken from a biopsy, swab, smear, or the like. When a blood sample is used, it is preferable to use serum or plasma separated from the collected blood for preparing a polynucleotide sequence sample.

また、生体試料は、必要に応じて凍結保存等の生体試料の種類に適した方法で保存されたものであってもよい。 Further, the biological sample may be stored by a method suitable for the type of the biological sample, such as cryopreservation, if necessary.

ポリヌクレオチド配列サンプルは公知の核酸配列情報ソースから入手したものでもよい。複数個の異なる個体、正常個体、特定の個体における疾患の異なる段階などで採取したサンプル、または病理学的素因を有する個体等の個体から採取した生体試料から調製されたものであり得る。 The polynucleotide sequence sample may be obtained from a known nucleic acid sequence information source. The sample may be prepared from a plurality of different individuals, a normal individual, a sample collected at different stages of the disease in a particular individual, or a biological sample collected from an individual such as an individual having a pathological predisposition.

また、ポリヌクレオチド配列サンプルは任意の好適なプライマーセットを用いたＰＣＲによって増幅されたＰＣＲ産物であってもよい。ＰＣＲに用いられるプライマーセットは、ＰＣＲを行う際に目的のＤＮＡ分子または遺伝子を増幅可能なフォワードプライマーとリバースプライマーとのセットである。また、本発明のサンプル調製に用いられるプライマーとしては、目的遺伝子を特異的に検出することができるものであれば特に限定されないが、例えば、１０ｂｐ〜５０ｂｐ好ましくは１８ｂｐ〜２４ｂｐからなるオリゴヌクレオチドである。オリゴヌクレオチド塩基配列は、対象遺伝子の配列情報に基づいて決定する。ここで、対象のゲノム配列のうち変異を含むエクソンの領域を増幅可能な位置に設計されたプライマーセットであり得る。 Further, the polynucleotide sequence sample may be a PCR product amplified by PCR using any suitable primer set. The primer set used for PCR is a set of a forward primer and a reverse primer that can amplify a target DNA molecule or gene when performing PCR. The primer used for preparing the sample of the present invention is not particularly limited as long as it can specifically detect the target gene, and is, for example, an oligonucleotide consisting of 10 bp to 50 bp, preferably 18 bp to 24 bp. . The oligonucleotide base sequence is determined based on the sequence information of the target gene. Here, it may be a primer set designed at a position capable of amplifying an exon region containing a mutation in the genome sequence of the subject.

また、大腸菌を用いた系で増幅およびクローニングされたプラスミドをサンプルとしてもよい。また、本発明に係るポリヌクレオチド配列サンプルはシーケンシングの対象とする遺伝子配列以外に、一定量のコンタミネーションを含んでいてもよい。 A plasmid amplified and cloned in a system using Escherichia coli may be used as a sample. Further, the polynucleotide sequence sample according to the present invention may contain a certain amount of contamination in addition to the gene sequence to be sequenced.

＜診断への応用＞
上述した本発明のコンピュータプログラムおよび変異の判定方法を利用し、疾患、障害または症状を検査、予測、診断またはモニタリングすることが可能である。本発明のコンピュータプログラムおよび変異の判定方法によって得られた結果は、医師によってなされる疾患、障害または症状の診断、またはその治療に対する被検体の反応の予測するための一材料となり得る。 <Application to diagnosis>
Using the computer program and the mutation determination method of the present invention described above, it is possible to examine, predict, diagnose or monitor a disease, disorder or symptom. The results obtained by the computer program and the mutation determination method of the present invention can be used as a source for diagnosing a disease, disorder, or symptom performed by a physician, or predicting the response of a subject to treatment thereof.

また、本発明を用いて、検査、予測または診断することが可能な疾患、障害または症状としては、癌が挙げられるがこれらに限定されない。また、ヒト以外の他の動物、例えば、哺乳類、無脊椎動物、哺乳類以外の脊椎動物などの疾患、障害または症状の調査にも適用され得る。また本発明の方法は、例えば、出生前の遺伝子診断等にも適用することができる。 In addition, examples of a disease, disorder, or symptom that can be tested, predicted, or diagnosed using the present invention include, but are not limited to, cancer. In addition, the present invention can be applied to the investigation of diseases, disorders or symptoms of animals other than humans, for example, mammals, invertebrates, and vertebrates other than mammals. The method of the present invention can also be applied to, for example, prenatal genetic diagnosis and the like.

さらに、本発明のコンピュータプログラムおよび変異の判定方法は、例えば、感染症の検出、ウイルス量モニタリング、ウイルス遺伝子型決定、環境試験、食品試験、疫学、法医学などのあらゆる特定のポリヌクレオチド配列検出を必要とする分野における、種々の核酸検出用途において利用され得る。 Further, the computer program and the method for determining a mutation of the present invention require detection of any specific polynucleotide sequence, for example, detection of infectious disease, monitoring of viral load, viral genotyping, environmental testing, food testing, epidemiology, and forensic medicine. In various nucleic acid detection applications.

本発明は上述した各実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。 The present invention is not limited to the embodiments described above, and various modifications are possible within the scope shown in the claims, and embodiments obtained by appropriately combining technical means disclosed in different embodiments. Is also included in the technical scope of the present invention.

（参考例１）
本発明者らは、対象遺伝子をＥＦＧＲ遺伝子としたとき、図９および図１０に示すように、多数のリファレンス配列を用いて、非常に良好な結果を得ることができた。すなわち、図９および図１０における「１０６種類版」が、ＣＯＳＭＩＣデータベースから取得したほぼ全ての変異に対応するリファレンス配列に相当し、「８種類版」が特許文献１において用いられたリファレンス配列に相当する。「８種類版」からリファレンス配列数を増加させ、ＣＯＳＭＩＣデータベースから取得した頻度の高い変異に対するリファレンスを用いた「２３種類版」を経て、最終的に「１０６種類版」とした。「１０６種類版」とすることによって、これより少ないリファレンス数の場合に生じ得たミスマッチおよび誤判定を防ぎ、好適な結果を得ることができる。 (Reference Example 1)
When the target gene was the EFGR gene, the present inventors were able to obtain very good results using a large number of reference sequences as shown in FIGS. 9 and 10. That is, “106 types” in FIGS. 9 and 10 correspond to reference sequences corresponding to almost all mutations obtained from the COSMIC database, and “8 types” correspond to the reference sequence used in Patent Document 1. I do. The number of reference sequences was increased from the “eight versions”, and after the “23 versions” using references to frequently mutated mutations obtained from the COSMIC database, finally the “106 versions” were obtained. By setting the “106 types version”, mismatch and erroneous determination that may occur when the number of references is smaller than this can be prevented, and a favorable result can be obtained.

（参考例２）
対象遺伝子をＥＦＧＲ遺伝子としたとき、いくつかの変異の正常検体データについて、ポアソン分布および負の二項分布に対する適合度を検定したところ、ポアソン分布に適合度が高い変異、および、負の二項分布に適合度が高い変異の両方が存在した。 (Reference Example 2)
When the target gene was the EFGR gene, the fitness of the Poisson distribution and the negative binomial distribution were tested for normal sample data of some mutations. There were both variants with good fit in the distribution.

本発明は、癌をはじめとする種々の遺伝子関連疾患の検査、予測もしくは診断、または出生前の遺伝子診断等に利用することができる。また、あらゆる特定のポリヌクレオチド配列検出を必要とする分野における、種々の核酸検出用途において利用され得る。 INDUSTRIAL APPLICABILITY The present invention can be used for examination, prediction or diagnosis of various gene-related diseases including cancer, or for prenatal genetic diagnosis. Further, it can be used in various nucleic acid detection applications in a field requiring detection of any specific polynucleotide sequence.

Claims

A mutation determination method performed by a computer to determine the presence or absence of a mutation in the target gene,
From a plurality of reads obtained by sequencing of a polynucleotide derived from a specimen, a corresponding read corresponding to a specific region of the target gene, and a mutated read having a specific mutation in the specific region among the corresponding reads An extraction step of extracting
The threshold for determining the presence or absence of a mutation at a significance level of Jo Tokoro, a calculation step of calculating using the number of extracted said corresponding leads,
A determination step of determining that the specific mutation is present when the number of the extracted mutant reads exceeds the calculated threshold value.

In the computer, statistical model information indicating a statistical model associated with the specific mutation is stored,
2. The mutation according to claim 1, wherein in the calculating step, a threshold value for determining the presence or absence of a significant difference of a predetermined significance level is calculated from the number of corresponding leads with reference to the statistical model information. 3. Judgment method.

3. The method according to claim 2, wherein the statistical model information includes a type of the statistical model associated with the specific mutation.

The mutation determination method according to claim 3, wherein the statistical model information further includes normal sample data relating to the specific mutation.

In the computer, in association with the specific mutation, minimum value information indicating whether to use the minimum value of the threshold is stored,
In the calculation step, when the minimum value information indicates that the minimum value of the threshold is to be used, the minimum value of the threshold calculated from the number of corresponding leads is determined by referring to a predetermined statistical model. The method according to claim 1, wherein the threshold value is calculated.

The computer stores a plurality of reference sequences,
The method according to any one of claims 1 to 5, wherein the extracting step includes mapping the plurality of reads to each of the plurality of reference sequences.

The specific mutation is a single nucleotide substitution mutation,
The reference sequence includes a specific reference sequence that is a sequence of the specific region,
In the extracting step, the read mapped to the specific reference sequence is extracted as the corresponding read, and it is determined whether each of the extracted corresponding reads has the specific mutation, and the specific mutation is determined. 7. The mutation determination method according to claim 6, wherein the corresponding read having the mutation is extracted as the mutation read.

The specific mutation is at least one mutation of insertion and deletion,
The reference sequence includes a specific reference sequence that is a sequence of the specific region, and a detection reference sequence that is a sequence in which at least one of insertion and deletion occurs in the specific reference sequence,
In the extracting step, the read mapped to the specific reference sequence and the read mapped to the detection reference sequence are extracted as the corresponding read, and the read mapped to the detection reference sequence is subjected to the mutation. The method according to claim 6, wherein the mutation is extracted as a lead.

The specific mutation is a mutation consisting of at least one of insertion of a predetermined base and deletion of a predetermined base,
The detection reference sequence includes a plurality of detection reference sequences corresponding to different mutations from each other,
The mutation determination according to claim 8, wherein in the extraction step, among the detection reference sequences, the read mapped to the detection reference sequence corresponding to the specific mutation is extracted as the mutation read. Method.

In the extraction step, among the mutant reads mapped to the single detection reference sequence, when the ratio of mismatched reads having a sequence mismatch with the detection reference sequence exceeds a predetermined ratio 10. The method according to claim 8, wherein the mismatched read is excluded from the corresponding read and the mutant read.

The method according to any one of claims 6 to 10, wherein in the extraction step, the reads mapped to a plurality of the reference sequences are excluded from the corresponding reads and the mutation reads.

12. The mutation determination method according to claim 10, further comprising a warning step of outputting a warning when at least one of the leads is excluded from the corresponding leads in the extracting step.

The target gene is an EGFR gene,
The method according to any one of claims 6 to 12, wherein the plurality of reference sequences include each of the base sequences shown in SEQ ID NOs: 1 to 101.

The mutation determination according to any one of claims 6 to 13, further comprising, before the extracting step, an updating step of updating the reference sequence based on information obtained from an external database. Method.

The method according to any one of claims 1 to 14, wherein, in the extracting step, the lead shorter than a predetermined length is removed before extracting the corresponding lead.

The polynucleotide is a PCR product,
The mutation determination method according to any one of claims 1 to 15, wherein the sequencing is a sequence in which a plurality of the reads are obtained by reading the same region redundantly.

A mutation determination program executed by a computer to determine the presence or absence of a mutation in the target gene,
From a plurality of reads obtained by sequencing of a polynucleotide derived from a specimen, a corresponding read corresponding to a specific region of the target gene, and a mutated read having a specific mutation in the specific region among the corresponding reads An extraction step of extracting
The threshold for determining the presence or absence of a mutation at a significance level of Jo Tokoro, a calculation step of calculating using the number of extracted said corresponding leads,
A determination step for determining that the specific mutation is present when the number of extracted mutation reads exceeds the calculated threshold value;

A computer-readable recording medium on which the mutation determination program according to claim 17 is recorded.