JP2017016665A

JP2017016665A - Method for selecting variation information from sequence data, system, and computer program

Info

Publication number: JP2017016665A
Application number: JP2016132843A
Authority: JP
Inventors: 正朗長崎; Masao Nagasaki; 要小島; Kaname Kojima; 洋介河合; Yosuke Kawai
Original assignee: Tohoku University NUC
Current assignee: Tohoku University NUC
Priority date: 2015-07-03
Filing date: 2016-07-04
Publication date: 2017-01-19

Abstract

PROBLEM TO BE SOLVED: To acquire variation information having high reliability from lead information derived from a DNA sample.SOLUTION: A threshold of lead depth is determined by finding that coincidence between genome variation information based on lead information acquired from a next generation sequencer and genome variation information acquired by another means such as an SNP array is closely associated with the depth of lead information (lead depth) and using the coincidence as an index, and it is found that genome variation information having high reliability can be acquired by filtering the genome variation information on the basis of the lead information.SELECTED DRAWING: Figure 1

Description

本発明は、コンピュータを用いる生物の変異情報の選択方法、当該方法を行うためのシステム、及びコンピュータプログラムに関する。 The present invention relates to an organism mutation information selection method using a computer, a system for performing the method, and a computer program.

いわゆる次世代シーケンサー（ＮＧＳ）が導入される前は、マイクロアレイが、総合的ゲノム解析に主に用いられていた。しかしながら、マイクロアレイを用いる場合、未知のゲノム変異情報を完全に検出することは困難である。なぜならば、プローブを予め設計しなければならないからである。 Prior to the introduction of so-called next-generation sequencers (NGS), microarrays were mainly used for comprehensive genome analysis. However, when microarrays are used, it is difficult to completely detect unknown genomic mutation information. This is because the probe must be designed in advance.

従って、ＮＧＳを用いることによるゲノム変異情報を検出するための手段の確立が、当該産業において期待されている。 Therefore, establishment of means for detecting genomic mutation information by using NGS is expected in the industry.

しかし、ＮＧＳによって提供されるリード情報に関しては、マッピングにおけるエラー等のリスクがある。 However, with respect to the lead information provided by NGS, there is a risk such as an error in mapping.

従って、リード情報から、このようなリスクを排除して、信頼性の高いゲノム変異情報を見出すことが課題であった。 Therefore, it has been a problem to find highly reliable genome mutation information by eliminating such risks from the read information.

本発明者は、ＮＧＳ等から得たリード情報に基づくゲノム変異情報と、ＳＮＰアレイ等の他の手段によって得られたゲノム変異情報との間の一致度が、リード情報の深度（リード深度）に密接に関連していることを見出した。そして、本発明者は、当該一致度を指標として用いることによってリード深度の閾値を決定し、当該リード情報に基づいてゲノム変異情報をフィルタリングすることにより、高い信頼性のゲノム変異情報を得ることができることを見出し、本発明を完成した。さらに、本発明者は、当該リード情報からのゲノム変異情報の信頼性を、上記基本的フィルタリング手段に加えて他のフィルタリング手段を用いることによって高めることができることを見出した。 The present inventor found that the degree of coincidence between genome mutation information based on read information obtained from NGS or the like and genome mutation information obtained by other means such as a SNP array is the depth of the read information (read depth). I found it closely related. Then, the inventor determines a read depth threshold by using the degree of coincidence as an index, and obtains genome mutation information with high reliability by filtering genome mutation information based on the read information. The present invention has been completed by finding out what can be done. Furthermore, the present inventor has found that the reliability of genome mutation information from the read information can be improved by using other filtering means in addition to the basic filtering means.

本発明は、コンピュータによる、遺伝子変異情報を選択する方法であって、以下の工程（１）〜（３）を実行することを特徴とする、遺伝子変異情報の選択方法（以下、本発明の選択方法ともいう）を提供する。
（１）各ＤＮＡサンプルについて以下を行う工程。
（１−１）マッピングアルゴリズム及び変異コールアルゴリズムに基づいて、標的生物の１以上の個体の各ＤＮＡサンプルに関する、リード深度が割り当てられた一次変異セット情報を、リード深度の所定範囲を有するサブセットにグループ分けし、
（１−２）当該各サブセットの変異情報の遺伝子型一致度の測定値を、一次変異セット情報と、マッピング以外の手段によって各ＤＮＡサンプルから得られる二次変異セット情報、との間の変異情報の遺伝子型を比較することによって計算する、
（１−３）所定の閾値より高い遺伝子型一致度の測定値を有するサブセットのリード深度の範囲を一つにまとめることによって、リード深度の適正範囲を決定する；
（２）各ＤＮＡサンプルについて、工程（１−１）における一次変異セット情報から、リード深度の適正範囲外のリード深度を有する変異情報を除いて、各ＤＮＡサンプルについて残っている変異情報を得る工程。
（３）全てのＤＮＡサンプルの中の少なくとも１つのＤＮＡサンプルについて、上記の工程（１）及び（２）を行った結果、残っている変異情報を、当該変異情報が属する遺伝子座と共に抽出して、目的の変異情報として選択する工程。 The present invention relates to a method for selecting gene mutation information by a computer, wherein the following steps (1) to (3) are performed. Method).
(1) A step of performing the following for each DNA sample.
(1-1) Based on the mapping algorithm and the mutation call algorithm, the primary mutation set information to which the read depth is assigned for each DNA sample of one or more individuals of the target organism is grouped into a subset having a predetermined range of the read depth. Divided,
(1-2) Mutation information between the primary mutation set information and the secondary mutation set information obtained from each DNA sample by means other than mapping, based on the measured value of the genotype coincidence of the mutation information of each subset. By comparing genotypes of
(1-3) determining an appropriate range of read depths by combining the range of read depths of subsets having a genotype match measurement value higher than a predetermined threshold;
(2) For each DNA sample, from the primary mutation set information in step (1-1), removing mutation information having a read depth outside the appropriate range of read depth to obtain remaining mutation information for each DNA sample .
(3) As a result of performing the above steps (1) and (2) for at least one DNA sample among all DNA samples, the remaining mutation information is extracted together with the locus to which the mutation information belongs. And selecting as target mutation information.

本発明において、「ＤＮＡサンプル」は、試験対象（標的生物）のＤＮＡを含むサンプルであり、当該ＤＮＡには、ゲノムＤＮＡ、ｃＤＮＡ等が含まれる。ＤＮＡサンプルの供給源である生物は、当該生物がゲノムを有する限り特に限定されず、ヒトのような哺乳類、爬虫類、昆虫、植物、バクテリア、及びウイルスを含む。これらのうち、ヒトが好ましい。 In the present invention, a “DNA sample” is a sample containing DNA of a test subject (target organism), and the DNA includes genomic DNA, cDNA and the like. The organism that is the source of the DNA sample is not particularly limited as long as the organism has a genome, and includes mammals such as humans, reptiles, insects, plants, bacteria, and viruses. Of these, humans are preferred.

「変異」という用語は、ゲノムのヌクレオチド配列の変異を意味し、具体的には、単一ヌクレオチド変異（ＳＮＶ）、挿入、又は欠失を含む。また、「変異情報」とは、これらを表すデジタル情報である。 The term “mutation” means a variation in the nucleotide sequence of a genome, and specifically includes a single nucleotide variation (SNV), insertion, or deletion. “Mutation information” is digital information representing these.

「リード情報」という用語は、ゲノムをシーケンサーで処理することにより得られるヌクレオチド配列に関するデジタル情報である。リード情報は、２つのタイプの情報に分けられる。当該一方は、単一の配列情報からなるシングルリード情報であり、他方は、ゲノム上で隣接した２領域の配列情報を含むペアエンドリード情報である。本発明において、リード情報は、両方のタイプのリード情報を含む。リード情報を提供するシーケンサーは、大量の情報を比較的短時間で提供できる、いわゆる「次世代シーケンサー」を含む「高性能シーケンサー」である。本発明を行った時点での次世代シーケンサーの例としては、Genome Sequence FLX（Roche(454)社）；Genome Analyzer IIx、HiSeq2000、HiSeq2500、及びMiSeq（Illumina社）；SOLiD（Applied Biosystem社）；並びにPacBio RS II（Pacific Biosciences社）が挙げられるが、これらに限定されず、現時点及び将来提供される全てのシーケンサーが含まれる。 The term “read information” is digital information about a nucleotide sequence obtained by processing a genome with a sequencer. The lead information is divided into two types of information. The one is single read information composed of single sequence information, and the other is paired end read information including sequence information of two adjacent regions on the genome. In the present invention, the lead information includes both types of lead information. Sequencers that provide read information are “high-performance sequencers” including a so-called “next-generation sequencer” that can provide a large amount of information in a relatively short time. Examples of next-generation sequencers at the time of the present invention include Genome Sequence FLX (Roche (454)); Genome Analyzer IIx, HiSeq2000, HiSeq2500, and MiSeq (Illumina); SOLiD (Applied Biosystem); Examples include, but are not limited to, PacBio RS II (Pacific Biosciences) and include all sequencers provided at the present time and in the future.

一次変異セット情報を生成する際に用いられる、「マッピングアルゴリズム」という用語は、リファレンス配列中の位置―そこからリード配列が読まれる―を特定するためのアルゴリズムであり、例えば、ＢＷＡ−ＭＥＭ、及びＢｏｗｔｉｅ２が含まれる。本明細書、請求の範囲、及び図面においては、「マッピング」は、「アラインメント」と同義である。「変異コールアルゴリズム」は、マッピングを行ったリード情報から、さらに変異情報が見つかったリード情報を、抽出するためのアルゴリズムであり、例えば、Ｂｃｆｔｏｏｌｓ、及びＧｅｎｏｍｅＡｎａｌｙｓｉｓＴｏｏｌｋｉｔ（ＧＡＴＫ）が含まれる。この変異コールアルゴリズムを、マッピングを行ったリード情報に対して施すことにより、変異情報が抽出され、「リファレンスホモ」、「ヘテロ」、及び、「非リファレンスホモ」の区分けが行われる。「リファレンスホモ」とは、所定の部位で標準ゲノム配列（又は参照ゲノム配列）と同じ塩基がホモ接合の遺伝子型で存在する形態であり、「非リファレンスホモ」とは、標準ゲノム配列とは異なる塩基がホモ接合の遺伝子型で存在する形態である。「ヘテロ」とは、標準ゲノム配列と同じ塩基と異なる塩基がヘテロ接合の遺伝子型で存在する形態である。 The term “mapping algorithm” used in generating primary mutation set information is an algorithm for specifying a position in a reference sequence—from which a read sequence is read—for example, BWA-MEM, and Bowtie2 is included. In the present specification, claims, and drawings, “mapping” is synonymous with “alignment”. The “mutation call algorithm” is an algorithm for extracting lead information in which mutation information is found from the mapped lead information, and includes, for example, Bcftools and Genome Analysis Toolkit (GATK). By applying this mutation call algorithm to the mapped read information, the mutation information is extracted, and “reference homo”, “hetero”, and “non-reference homo” are classified. “Reference homo” is a form in which the same base as the standard genomic sequence (or reference genomic sequence) exists in a predetermined site in a homozygous genotype, and “non-reference homo” is different from the standard genomic sequence. In this form, the base exists in a homozygous genotype. “Hetero” is a form in which bases different from the same base as the standard genome sequence exist in a heterozygous genotype.

「リード深度」という用語は、coverageと同義であり、ＤＮＡサンプルのマッピングの際に計算される、各変異部位（遺伝子座）に関する対応するリードの数を意味し、重複する部位の読み取りが、試験対象の変異部位についてどの程度行われたかを示す指標である。本明細書中、単に「深度」と記載することもある。 The term “read depth” is synonymous with coverage, meaning the number of corresponding reads for each mutation site (locus), calculated during mapping of the DNA sample, where duplicate site readings are tested This is an index indicating how much the target mutation site has been performed. In this specification, it may be simply described as “depth”.

「二次変異セット情報」を得るための「マッピング以外の手段」は、特に限定されず、例えば、サンガーシーケンス、マスアレイ（登録商標）、及びＳＮＰアレイが含まれ、一次変異セット情報を得るために使用したシーケンサーと少なくとも同等の精度、好ましくはより高い精度を有している。これらの手段のなかで、ＳＮＰアレイが、安価でかつ、一度に高精度で多数の変異部位を決定できるために、好ましい。この場合、「一度に高精度で多数」は、好ましくは、約３００，０００〜約３，０００，０００ＳＮＰに対応している。 “Means other than mapping” for obtaining “secondary mutation set information” are not particularly limited, and include, for example, Sanger sequence, mass array (registered trademark), and SNP array, and to obtain primary mutation set information. It has at least the same accuracy as the sequencer used, preferably higher accuracy. Among these means, the SNP array is preferable because it is inexpensive and can determine a large number of mutation sites with high accuracy at a time. In this case, “many with high accuracy at one time” preferably corresponds to about 300,000 to about 3,000,000 SNPs.

「二次変異セット情報」には、上記の「リファレンスホモ」、「ヘテロ」、及び、「非リファレンスホモ」の区分けが、具体的な「マッピング以外の手段」に応じた手法により割り付けられている。例えば、ＳＮＰアレイの場合は、プローブ色素の色彩や蛍光色素の蛍光を基に、「ジェノタイプコール」というクラスタリングアルゴリズムが施されることにより、上記の「リファレンスホモ」、「ヘテロ」、及び、「非リファレンスホモ」の区分けがなされる。また、サンガーシーケンスの場合は、プローブ色素の色彩や蛍光色素の蛍光を基にしたチャートの波形を解析することにより、上記の「リファレンスホモ」、「ヘテロ」、及び、「非リファレンスホモ」の区分けがなされることが知られている。 In the “secondary mutation set information”, the above-mentioned classification of “reference homo”, “hetero”, and “non-reference homo” is assigned by a method according to a specific “means other than mapping”. . For example, in the case of an SNP array, a clustering algorithm called “genotype call” is applied based on the color of the probe dye or the fluorescence of the fluorescent dye, so that the above “reference homo”, “hetero”, and “ “Non-reference homo” is classified. In the case of the Sanger sequence, the “reference homo”, “hetero”, and “non-reference homo” classifications are analyzed by analyzing the waveform of the chart based on the color of the probe dye and the fluorescence of the fluorescent dye. Is known to be made.

上記工程（２−２）における「遺伝子型一致度の測定値」は、（１）適合率（precision）、及び／又は、（２）検出率(power)であることが好適である。すなわち、適合率と検出率の双方、又は、いずれか一方を測定値として用いることができる。具体的には、これらの測定指標値に対して所定の閾値を設定して、測定値が、当該適合率又は検出率の閾値以上である場合、あるいは、当該適合率及び検出率の双方の閾値以上のである場合を、「遺伝子型一致度が所定の閾値より高い」として認定することができる。好ましい態様としては、適合率と検出率いずれか一方であれば、適合率を優先させて閾値指標とすることが好適であり、適合率と検出率の双方を閾値の指標とすることがさらに好適である。ここで、これらの測定指標値について説明する。 The “measured value of genotype coincidence” in the step (2-2) is preferably (1) precision and / or (2) detection rate (power). That is, both the precision and the detection rate, or one of them can be used as the measurement value. Specifically, a predetermined threshold value is set for these measurement index values, and the measured value is equal to or higher than the threshold value of the accuracy rate or the detection rate, or the threshold values of both the accuracy rate and the detection rate. The above case can be recognized as “the genotype matching degree is higher than a predetermined threshold”. As a preferred aspect, if either the precision ratio or the detection ratio is used, it is preferable to prioritize the precision ratio as a threshold index, and it is more preferable that both the precision ratio and the detection ratio be threshold indices. It is. Here, these measurement index values will be described.

＜指標値（適合率と検出率について）＞
これらの指標値は、工程（１）で行われる「変異コール」によって算出可能な値である。変異コールアルゴリズムの実行により、先行して行われるリード情報のマッピングやＳＮＰアレイの使用、に際して認められる、「リード情報が参照配列情報とマッチしない『非マッチング情報』」に対して、当該非マッチング情報が、エラーによるミスマッチなのか、又は、ＤＮＡサンプルが真に有する変異によるミスマッチなのか、についての判別が行われる。ここで、非マッチング情報が「エラーによるミスマッチ」ではないと変異コールアルゴリズムの実行により判定がなされる場合に、非マッチングが起こった部位には変異が存在すると判定される。さらに、ヒトのような２倍体生物においては、変異が同定された部位のリード情報がすべて非リファレンス型である場合は「非リファレンスホモ」であり、これは父母双方由来のＤＮＡに変異が存在したものとみなされる。また、変異が同定された部位のリード情報にリファレンス型と非リファレンス型が混在する場合「ヘテロ」であり、父母のいずれかにおいて変異が存在したものとみなされる。さらに変異コールアルゴリズムが変異と判定しない場合においても、全てのリード情報がリファレンス型である部位はリファレンスホモの状態にあるとみなすことができる。 <Indicator values (about precision and detection rate)>
These index values are values that can be calculated by the “mutation call” performed in step (1). By executing the mutation call algorithm, the non-matching information corresponding to the “non-matching information whose lead information does not match the reference sequence information”, which is recognized when mapping the lead information or using the SNP array performed in advance, is performed. Is a mismatch due to an error or a mismatch due to a mutation that the DNA sample truly has. Here, if the non-matching information is not “mismatch due to an error” and it is determined by executing the mutation call algorithm, it is determined that there is a mutation in the portion where the non-matching has occurred. Furthermore, in diploid organisms such as humans, when all the read information of the site where the mutation is identified is non-reference type, it is “non-reference homozygous”, which means that there is a mutation in DNA from both parents It is regarded as having done. In addition, when the reference type and the non-reference type are mixed in the read information of the site where the mutation is identified, it is “hetero”, and it is considered that the mutation exists in one of the parents. Further, even when the mutation call algorithm does not determine that the mutation is present, it is possible to regard the site where all the read information is of the reference type as being in the reference homo state.

これらのことをまとめると、例えば、標的生物がヒトの場合において、変異コールアルゴリズムを実行した場合の非マッチング情報としては、（ア）リファレンスホモ、（イ）ヘテロ、（ウ）非リファレンスホモの３通りの分類が可能であり、一次変異セット情報の場合と、二次変異セット情報の場合を組み合わせると、併せて９通りの場合が認められる。 In summary, for example, when the target organism is a human, the non-matching information when the mutation call algorithm is executed includes (a) reference homo, (b) hetero, and (c) non-reference homo. Classification is possible, and when the primary mutation set information is combined with the secondary mutation set information, nine cases are recognized.

下記表１において、これらの９通りについて図示し、かつ、これらを基に算出される適合率と検出率について説明する。ＮＧＳに基づく変異コールの結果が一次変異セット情報を生成し、ＳＮＰアレイに基づく変異コールの結果が二次変異セット情報を生成する。表中のアルファベット（ａ〜ｉ）は、各サブセットの中で変異情報として検出された遺伝子型の数である。例えば、下記表１中の「ｄ」は、ＮＧＳで「ヘテロ」と判定され、ＳＮＰアレイで「リファレンスホモ」と判定された遺伝子型の数である。 In Table 1 below, these nine patterns are illustrated, and the precision and detection rate calculated based on these are described. The result of the mutation call based on NGS generates primary mutation set information, and the result of the mutation call based on SNP array generates secondary mutation set information. The alphabets (ai) in the table are the number of genotypes detected as mutation information in each subset. For example, “d” in Table 1 below is the number of genotypes determined to be “hetero” by NGS and “reference homo” by the SNP array.

ここで、適合率は、「一次変異セット情報（本例ではＮＳＧ）で、「ヘテロ又は非リファレンスホモ」と判定された内、二次変異セット情報（本例ではＳＮＰアレイ）においても同じ判定がなされた割合」であり、上記表１においては、「（ｅ＋ｉ）／（ｄ＋ｅ＋ｆ＋ｈ＋ｉ）」で表される値である。検出率は、「二次変異セット情報（本例ではＳＮＰアレイ）で、「ヘテロ又は非リファレンスホモ」と判定された内、一次変異セット情報（本例ではＮＳＧ）においても同じ判定がなされた割合」であり、上記表１においては、「（ｅ＋ｉ）／（ｂ＋ｃ＋ｅ＋ｆ＋ｈ＋ｉ）」で表される値である。適合率と検出率共に、表示形式は限定されず、例えば、直接比率値表示であっても、百分率表示であってもよい。 Here, the relevance rate is “primary mutation set information (NSG in this example) and determined as“ hetero or non-reference homo ”, and the same determination is also made in secondary mutation set information (SNP array in this example). Is a ratio represented by “(e + i) / (d + e + f + h + i)” in Table 1. The detection rate is the ratio of “secondary mutation set information (SNP array in this example) determined as“ hetero or non-reference homo ”, and the same determination was made in primary mutation set information (NSG in this example)” In Table 1, the value is represented by “(e + i) / (b + c + e + f + h + i)”. The display format of both the precision rate and the detection rate is not limited, and for example, direct ratio value display or percentage display may be used.

上記工程（２−３）における適合率の閾値は、０．９９５以上であることが好適であり、さらに好適には０．９９８以上である。同じく検出率の閾値は、０．９３以上であることが好適であり、さらに好適には０．９５以上である。 The threshold value of the precision in the step (2-3) is preferably 0.995 or more, and more preferably 0.998 or more. Similarly, the threshold of the detection rate is preferably 0.93 or more, and more preferably 0.95 or more.

さらに本発明の選択方法は、上記工程（１）〜（４）が行われて第一の特定変異情報を得て、下記（Ａ）〜（Ｄ）から選ばれる１以上の工程において、除かれるべき変異情報として定義された１以上の変異情報が、第一の特定変異情報から除かれて第二の特定変異情報を得、当該第二の特定変異情報を、目的の変異情報として選択される態様とすることも可能である。
（Ａ）第一の特定変異情報が、２以上のＤＮＡサンプルを用いることによって得られる場合であって、リード深度の適正範囲外のリード深度を有する特定の変異情報の比率が、工程（３）中の変異部位中の５％〜２０％を超える場合、当該特定の変異情報が、上記の除かれるべき変異情報として定義される；
（Ｂ）低複雑度領域に割り当てられた領域中に存在する全ての変異情報が、一次変異セット情報から選択され、遺伝子型一致度の測定値が、当該全ての変異情報と各低複雑度領域に関する第二変異セット情報の変異情報との間で計算され、所定の閾値以下の遺伝子型一致の測定値を有する低複雑度領域の全ての変異情報が、上記の除かれるべき変異情報として定義される；
（Ｃ）工程（１）、又は、工程（１）〜（４）が、最初の工程（１）とは異なるマッピングアルゴリズム及び異なる変異コールアルゴリズムを用いることによって行われて、各ＤＮＡサンプルについて、異なる一次変異セット情報又は異なる変異情報を得、そして、最初の変異情報に含まれるが、異なる一次変異セット情報又は異なる変異情報に含まれない変異情報が、除かれるべき変異情報として定義される；
（Ｄ）ハーディ・ワインベルグ平衡テストに関するアルゴリズムが行われて、ハーディ・ワインベルグ平衡から逸脱する変異情報が、除かれるべき変異情報として定義される。 Further, in the selection method of the present invention, the above-mentioned steps (1) to (4) are carried out to obtain the first specific mutation information, which is excluded in one or more steps selected from the following (A) to (D). One or more mutation information defined as power mutation information is removed from the first specific mutation information to obtain the second specific mutation information, and the second specific mutation information is selected as the target mutation information. It is also possible to adopt an aspect.
(A) When the first specific mutation information is obtained by using two or more DNA samples, the ratio of the specific mutation information having a read depth outside the appropriate range of the read depth is the step (3). If more than 5% to 20% in the mutation site in the, the specific mutation information is defined as the mutation information to be excluded above;
(B) All the mutation information existing in the area assigned to the low complexity area is selected from the primary mutation set information, and the measured value of the genotype coincidence is the all mutation information and each low complexity area. All mutation information in the low-complexity region that is calculated between the mutation information in the second mutation set information and has a genotype-matched measurement value below a predetermined threshold is defined as the mutation information to be excluded. ;
(C) Step (1) or steps (1) to (4) are performed by using a different mapping algorithm and different mutation call algorithm from the first step (1), and are different for each DNA sample. Mutation information to obtain primary mutation set information or different mutation information and is included in the first mutation information but not in different primary mutation set information or different mutation information is defined as mutation information to be excluded;
(D) An algorithm related to the Hardy-Weinberg equilibrium test is performed, and mutation information deviating from the Hardy-Weinberg equilibrium is defined as mutation information to be removed.

これらの追加工程（Ａ）〜（Ｄ）は、上述の本発明の選択方法が行われることを前提として、工程（Ａ）、（Ｂ）、（Ｃ）及び（Ｄ）として示される４つのタイプの他のフィルタリング手段の一つがさらに行われるか、それらの二つから四つが組み合わせて行われる場合を規定している。これらの追加工程（Ａ）〜（Ｄ）を、本発明の選択方法に加味することにより、リード情報由来の、より高い信頼性を有する変異情報の入手が可能となる。なお、上記の追加工程（Ａ）における「特定の変異情報」とは、例えば、個別のＳＮＶ等の変異情報を意味するものである。 These additional steps (A) to (D) are four types shown as steps (A), (B), (C) and (D) on the premise that the selection method of the present invention described above is performed. One of the other filtering means is further performed, or two to four of them are combined and performed. By adding these additional steps (A) to (D) to the selection method of the present invention, it is possible to obtain mutation information having higher reliability derived from lead information. The “specific mutation information” in the additional step (A) means, for example, mutation information such as individual SNV.

さらに本発明は、上記の本発明の選択方法を行うためのコンピュータシステムを提供し、コンピュータに本発明の選択方法を実行させるためのアルゴリズムを含むコンピュータプログラム、本発明のコンピュータプログラムを、コンピュータが読み取り可能な状態で保存されている、デジタルメディアを提供する。 Furthermore, the present invention provides a computer system for performing the above-described selection method of the present invention, a computer program including an algorithm for causing a computer to execute the selection method of the present invention, and a computer reading the computer program of the present invention. Provide digital media, stored as possible.

また、本発明は、本発明の選択方法により得られた変異情報を提供する。 The present invention also provides mutation information obtained by the selection method of the present invention.

さらに本発明は、コンピュータによりリード深度の適正範囲を決定する方法であって、以下の工程（１）〜（３）を含む方法（以下、本発明の深度決定方法ともいう）を提供する。
（１）マッピングアルゴリズム及び変異コールアルゴリズムに基づいて、標的生物の１以上の個体の各ＤＮＡサンプルに関する、リード深度が割り当てられた一次変異セット情報を、当該一次変異セット情報を、リード深度の所定範囲を有するサブセットにグループ分けし、
（２）当該各サブセットの変異情報の遺伝子型一致度の測定値を、一次変異セット情報と、マッピング以外の手段によって各ＤＮＡサンプルから得られる二次変異セット情報、との間の変異情報の遺伝子型を比較することによって計算する、
（３）所定の閾値より高い遺伝子型一致度の測定値を有するサブセットのリード深度の範囲を一つにまとめることによって、リード深度の適正範囲を決定する。 Furthermore, the present invention provides a method for determining an appropriate range of the lead depth by a computer, which includes the following steps (1) to (3) (hereinafter also referred to as the depth determination method of the present invention).
(1) Based on the mapping algorithm and the mutation call algorithm, the primary mutation set information to which the read depth is assigned for each DNA sample of one or more individuals of the target organism, the primary mutation set information, and the predetermined range of the read depth Group into subsets with
(2) The gene of the mutation information between the primary mutation set information and the secondary mutation set information obtained from each DNA sample by means other than mapping, using the measured value of the genotype coincidence of the mutation information of each subset. Calculate by comparing types,
(3) The appropriate range of the read depth is determined by combining the read depth ranges of the subsets having the genotype matching degree higher than the predetermined threshold.

本発明の深度決定方法の個々の定義は、本発明の選択方法において述べた通りである。例えば、「遺伝子型一致度の測定値」は、上述した、「適合率（precision）及び／又は検出率(power)」であることが好適である。 The individual definitions of the depth determination method of the present invention are as described in the selection method of the present invention. For example, the “measured value of genotype coincidence” is preferably the above-described “precision and / or detection rate (power)”.

さらに本発明は、本発明の深度決定方法を行うためのコンピュータシステム又はコンピュータ装置を提供し、コンピュータに本発明の深度決定方法を実行させるためのアルゴリズムを含むコンピュータプログラムを提供し、当該コンピュータプログラムが、コンピュータ読み取り可能状態で保存されている、デジタルメディアを提供する。 Furthermore, the present invention provides a computer system or a computer apparatus for performing the depth determination method of the present invention, and provides a computer program including an algorithm for causing a computer to execute the depth determination method of the present invention. Provide digital media stored in a computer readable state.

本発明によれば、高信頼性の変異情報を、ＤＮＡサンプル由来のリード情報から得ることができる。 According to the present invention, highly reliable mutation information can be obtained from lead information derived from a DNA sample.

本発明のＳＮＶフィルターの全体像を示す図面である。It is drawing which shows the whole image of the SNV filter of this invention.

上記した本発明の選択方法と深度決定方法は、具体的には、本発明のコンピュータシステムにおいて実行され、かつ、本発明のコンピュータプログラムのアルゴリズムを、コンピュータにおいて実現させることにより実行される。これらの形態により、高信頼性の変異情報を、ＤＮＡサンプル由来のリード情報から容易に得ることができる。 Specifically, the above-described selection method and depth determination method of the present invention are executed in the computer system of the present invention, and are executed by causing the computer program algorithm of the present invention to be realized in a computer. With these forms, highly reliable mutation information can be easily obtained from lead information derived from a DNA sample.

［本発明のコンピュータシステム］
コンピュータプシステムまたはコンピュータ装置は、主要には、本発明の選択方法の各工程を、コンピュータにおいて行うためのコンピュータシステムである。なお、コンピュータシステムは、コンピュータ装置と実質的に同義である。コンピュータシステムまたは装置は通常、記録部（ＨＤ）、演算処理部（ＣＰＵ）、操作部（キーボード）、及び表示部（ディスプレイ）を有している。記録部において、本発明の選択方法を行うための基礎を形成する電子情報が記録され、演算処理部によって必要な工程の処理が、一時記録部（メモリ）に読み出された電子情報に対して行われ、そして結果がまた記録部に記録される。演算処理部において、操作部の操作を促進するためのまたは処理結果を表示するための画像データが作成され、当該画像データがビデオＲＡＭを介して、表示部に表示される。このようにして、各工程を前進させる処理が行われ、そして、最後に、必要な変異情報が導き出される。本発明の深度決定方法を行うためのコンピュータシステム又は装置においても同様であり、適正リード深度を決定するための各工程の処理がコンピュータにおいて行われることにより、所望の適正リード深度が導き出される。 [Computer System of the Present Invention]
The computer system or computer apparatus is mainly a computer system for performing each step of the selection method of the present invention in a computer. A computer system is substantially synonymous with a computer device. A computer system or apparatus usually includes a recording unit (HD), an arithmetic processing unit (CPU), an operation unit (keyboard), and a display unit (display). In the recording unit, electronic information forming a basis for performing the selection method of the present invention is recorded, and the processing of the necessary steps by the arithmetic processing unit is performed on the electronic information read out to the temporary recording unit (memory). Done, and the result is also recorded in the recorder. In the arithmetic processing unit, image data for promoting the operation of the operation unit or displaying the processing result is created, and the image data is displayed on the display unit via the video RAM. In this way, the process of advancing each process is performed, and finally necessary mutation information is derived. The same applies to the computer system or apparatus for performing the depth determination method of the present invention, and the processing of each step for determining the appropriate lead depth is performed in the computer, whereby a desired appropriate lead depth is derived.

本発明のコンピュータシステム又は装置の主要な態様は、塩基配列のデータから特定変異情報を選択するコンピュータシステムであって、少なくとも記録部と演算処理部を備え、下記の処理（ａ）〜（ｆ）が実行されることを特徴とするコンピュータシステム又は装置である。
（ａ）当該記録部には、マッピングアルゴリズム及び変異コールアルゴリズムに基づいて、標的生物の１以上の個体の各ＤＮＡサンプルに関する、リード深度が割り当てられた一次変異セット情報、及び、各ＤＮＡサンプルからマッピング以外の手段によって得られた二次変異セット情報が記録されており、
（ｂ）当該演算処理部では、前記記録部から読み出された一次変異セット情報を、リード深度の所定範囲を有するサブセットにグループ分けする処理が実行され、
（ｃ）各サブセットの変異情報の遺伝子型一致度の測定値が、一次変異セット情報と、前記記録部から読み出された上記二次変異セット情報との間の変異情報の遺伝子型の比較処理が実行されることによって算出され、
（ｄ）所定の閾値より高い遺伝子型一致度の測定値を有するサブセットの、リード深度の範囲を一つにまとめることによって、リード深度の適正範囲の決定処理が実行され、
（ｅ）各ＤＮＡサンプルについて一次変異セット情報から、前記（ｄ）により決定されたリード深度の適正範囲外のリード深度を有する変異情報の除外処理が実行され、各ＤＮＡサンプルについて残っている変異情報が抽出され、
（ｆ）全てのＤＮＡサンプルの中の少なくとも１つのＤＮＡサンプルについて、上記の処理（ａ）〜（ｅ）を行った結果、残っている変異情報が、当該変異情報が属する遺伝子座と共に抽出されて、当該遺伝情報を目的の特定変異情報として特定される処理。 A main aspect of the computer system or apparatus of the present invention is a computer system that selects specific mutation information from base sequence data, and includes at least a recording unit and an arithmetic processing unit, and includes the following processes (a) to (f): Is executed by a computer system or apparatus.
(A) Based on the mapping algorithm and the mutation call algorithm, the recording unit maps the primary mutation set information to which the read depth is assigned for each DNA sample of one or more individuals of the target organism, and mapping from each DNA sample Secondary mutation set information obtained by means other than is recorded,
(B) In the arithmetic processing unit, a process of grouping primary mutation set information read from the recording unit into a subset having a predetermined range of read depth is executed,
(C) A genotype comparison process of the mutation information between the primary mutation set information and the secondary mutation set information read from the recording unit as the measured value of the genotype matching degree of the mutation information of each subset Is calculated by executing
(D) The determination process of the appropriate range of the read depth is performed by combining the range of the read depth of the subset having the genotype matching degree higher than the predetermined threshold into one,
(E) From the primary mutation set information for each DNA sample, mutation information having a read depth outside the appropriate range of the read depth determined in (d) above is executed, and the remaining mutation information for each DNA sample Is extracted,
(F) As a result of performing the above processing (a) to (e) for at least one DNA sample among all DNA samples, the remaining mutation information is extracted together with the gene locus to which the mutation information belongs. , A process for identifying the genetic information as target specific mutation information.

上記処理（ａ）〜（ｆ）は、本発明の選択方法における各工程（１）〜（３）に対応する処理である。すなわち、処理（ａ）（ｂ）は、工程（１−１）に該当し、処理（ｃ）は工程（１−２）に該当し、処理（ｄ）は工程（１−３）に該当し、処理（ｅ）は工程（２）に該当し、処理（ｆ）は工程（３）に該当する。 The said process (a)-(f) is a process corresponding to each process (1)-(3) in the selection method of this invention. That is, process (a) and (b) correspond to process (1-1), process (c) corresponds to process (1-2), and process (d) corresponds to process (1-3). The process (e) corresponds to the process (2), and the process (f) corresponds to the process (3).

処理（ａ）における「一次変異セット情報」は、上記のように、マッピングアルゴリズム及び変異コールアルゴリズムに基づいて、標的生物の１以上の個体の各ＤＮＡサンプルに関する、リード深度が割り当てられた変異情報である。「マッピングアルゴリズム」と「変異コールアルゴリズム」については、上述した通りである。当該「一次変異セット情報」の確保手段は限定されず、例えば、同一のコンピュータ端末において、リード情報から、「マッピングアルゴリズム」と「変異コールアルゴリズム」に基づいて生成したものを用いても良いし、別途他所で生成された「一次変異セット情報」を本発明のコンピュータシステムに対して用いても良い。さらに、ＳＮＰアレイ等のマッピング以外の手段によって得られた「二次変異セット情報」についても、確保手段は全く限定されない。 As described above, the “primary mutation set information” in the process (a) is mutation information to which a read depth is assigned for each DNA sample of one or more individuals of the target organism based on the mapping algorithm and the mutation call algorithm. is there. The “mapping algorithm” and the “mutation call algorithm” are as described above. The means for securing the “primary mutation set information” is not limited. For example, in the same computer terminal, from the lead information, those generated based on the “mapping algorithm” and the “mutation call algorithm” may be used, “Primary mutation set information” separately generated elsewhere may be used for the computer system of the present invention. Furthermore, as for “secondary mutation set information” obtained by means other than mapping such as an SNP array, the securing means is not limited at all.

処理（ａ）は、後述する実施例の「選択手順１」に該当する処理である。 The process (a) is a process corresponding to “selection procedure 1” in an embodiment described later.

リード深度範囲毎のサブセットへのグループ分け処理（処理（ｂ））、及び、一次変異セット情報と二次変異セット情報との間の変異情報の遺伝子型の比較処理（処理（ｃ））は、後述する実施例の「選択手順２（図１の「手順２」の真ん中のグラフ）」に該当する処理である。上述のように処理（ｃ）の「各サブセットの変異情報の遺伝子型一致度の測定値」は、好適には適合率（precision）、及び／又は、検出率(power)として、さらに好適には適合率として、最も好適には適合率及び検出率として用いられる。適合率の閾値は、０．９９５以上であることが好適であり、さらに好適には０．９９８以上である。検出率の閾値は、０．９３以上であることが好適であり、さらに好適には０．９５以上である。 The grouping process (process (b)) into subsets for each read depth range, and the genotype comparison process (process (c)) of the mutation information between the primary mutation set information and the secondary mutation set information, This is processing corresponding to “selection procedure 2 (the middle graph of“ procedure 2 ”in FIG. 1)” in an embodiment described later. As described above, the “measured value of the genotype coincidence of mutation information of each subset” in the processing (c) is preferably a precision and / or detection rate (power), more preferably As the precision, it is most preferably used as the precision and the detection rate. The threshold value of the precision is preferably 0.995 or more, and more preferably 0.998 or more. The detection rate threshold is preferably 0.93 or more, and more preferably 0.95 or more.

次いで、リード深度の適正範囲の決定処理（処理（ｄ））、及び、リード深度の適正範囲外のリード深度を有する変異情報の除外処理（処理（ｅ））は、後述する実施例の「選択手順２（図１の「手順２」）」に該当する処理である。 Next, the determination process of the appropriate range of the read depth (process (d)) and the exclusion process of the mutation information having the read depth outside the appropriate range of the read depth (process (e)) are performed in the “selection” example described later. This is a process corresponding to the procedure 2 ("procedure 2" in FIG. 1).

さらに、目的の特定変異情報の特定処理（処理（ｆ））の結果の一例が、後述する実施例の表２の「手順２」に示されている。 Furthermore, an example of the result of the target specific mutation information specifying process (process (f)) is shown in “Procedure 2” in Table 2 of Examples described later.

本発明のコンピュータシステムは、上記コンピュータシステムの実行により得られた特定変異情報を第一の特定変異情報として、さらに、下記（α）〜（δ）から選ばれる１以上の処理より、除かれるべき変異情報として定義された１以上の変異情報が、第一の特定変異情報から除かれて第二の特定変異情報を得る処理が実行され、当該第二の特定変異情報が、目的の変異情報として抽出される処理が実行されるコンピュータシステムであることが好適である。
（α）第一の特定変異情報が、２以上のＤＮＡサンプルを用いることによって得られる場合であって、リード深度の適正範囲外のリード深度を有する特定の変異情報の比率が、上記処理（ｆ）によって得られた変異情報の５％〜２０％を超える場合、当該特定の変異情報が、上記の除かれるべき変異情報として定義される処理。
（β）低複雑度領域に割り当てられた領域中に存在する全ての変異情報が、一次変異セット情報から抽出され、遺伝子型一致度の測定値が、当該全ての変異情報と各低複雑度領域に関する第二変異セット情報の変異情報との間で算出処理され、所定の閾値以下の遺伝子型一致の測定値を有する低複雑度領域の全ての変異情報が、上記の除かれるべき変異情報として定義される処理。
（γ）最初に用いられた一次変異セット情報の生成とは異なるマッピングアルゴリズム及び異なる変異コールアルゴリズムを用いることによって生成された、各ＤＮＡサンプルについての異なる一次変異セット情報又は異なる変異情報を、処理（ｆ）により最初に抽出された変異情報に含まれるが、当該異なる一次変異セット情報又は異なる変異情報に含まれない変異情報が、上記の除かれるべき変異情報として定義される処理。
（δ）ハーディ・ワインベルグ平衡テストに関するアルゴリズムが行われて、ハーディ・ワインベルグ平衡から逸脱する変異情報が、除かれるべき変異情報として定義される処理。 In the computer system of the present invention, the specific mutation information obtained by the execution of the computer system should be excluded from one or more processes selected from the following (α) to (δ) as the first specific mutation information. One or more mutation information defined as mutation information is removed from the first specific mutation information to obtain second specific mutation information, and the second specific mutation information is used as target mutation information. It is preferable that the computer system executes the extracted process.
(Α) When the first specific mutation information is obtained by using two or more DNA samples, the ratio of the specific mutation information having a read depth outside the appropriate range of the read depth is determined by the above processing (f The processing in which the specific mutation information is defined as the mutation information to be excluded when it exceeds 5% to 20% of the mutation information obtained by (1).
(Β) All the mutation information existing in the area assigned to the low complexity area is extracted from the primary mutation set information, and the measured value of the genotype coincidence is the all mutation information and each low complexity area. All mutation information in the low-complexity region that is calculated and processed with the mutation information in the second mutation set information regarding and has a genotype-matched measurement value below a predetermined threshold is defined as the mutation information to be excluded. Processing.
(Γ) processing different primary mutation set information or different mutation information for each DNA sample generated by using a different mapping algorithm and different mutation call algorithm from the generation of the first used primary mutation set information ( The process in which the mutation information that is included in the mutation information extracted first by f) but not included in the different primary mutation set information or the different mutation information is defined as the mutation information to be removed.
(Δ) A process in which an algorithm related to the Hardy-Weinberg equilibrium test is performed, and mutation information deviating from the Hardy-Weinberg equilibrium is defined as mutation information to be removed.

「変異部位毎のリード深度に基づく変異情報の除去」に関する追加処理αは、本発明の選択方法の追加工程（ａ）に相当する処理であり、後述する実施例の「選択手順３」（図１の「手順３」の、特に真ん中）に相当し、その結果の一例は、表２の「手順３」に記載されている。 The additional process α relating to “removal of mutation information based on the read depth for each mutation site” is a process corresponding to the additional step (a) of the selection method of the present invention, and “selection procedure 3” (see FIG. 1 corresponds to “Procedure 3”, particularly in the middle), and an example of the result is described in “Procedure 3” in Table 2.

「ゲノムの複雑性に基づく変異情報の除去」に関する追加処理βは、本発明の選択方法の追加工程（ｂ）に相当する処理であり、後述する実施例の「選択手順４」（図１の「手順４」の、特に左側の表と、右図）に相当し、その結果の一例は、表２の「手順４」に記載されている。 The additional process β related to “removal of mutation information based on genome complexity” is a process corresponding to the additional step (b) of the selection method of the present invention, and “selection procedure 4” (see FIG. This corresponds to “Procedure 4”, in particular, the table on the left side and the right figure.

「解析手法の偏りに関する変異情報の除去」に関する追加処理γは、本発明の選択方法の追加工程（ｃ）に相当する処理であり、後述する実施例の「選択手順５」（図１の「手順５」の特に上図）に相当し、その結果の一例は、表２の「手順５」に記載されている。 The additional process γ related to “removal of mutation information related to bias of analysis method” is a process corresponding to the additional step (c) of the selection method of the present invention, and “selection procedure 5” (see “ An example of the result is described in “Procedure 5” in Table 2.

「集団遺伝学的な変異情報の除去」に関する追加処理（δ）は、本発明の選択方法の追加工程（ｄ）に相当する処理であり、具体的には、ハーディ・ワインベルグ平衡検定を行い、当該平衡から逸脱した変異情報の除去が行われている。本追加処理（δ）は、後述する実施例の「選択手順６」（図１の「手順６」）に相当し、その結果の一例は、表２の「手順６」に記載されている。 The additional process (δ) relating to “removal of population genetic variation information” is a process corresponding to the additional step (d) of the selection method of the present invention. Specifically, a Hardy-Weinberg equilibrium test is performed. The mutation information deviating from the equilibrium is removed. This additional processing (δ) corresponds to “selection procedure 6” (“procedure 6” in FIG. 1) of an embodiment described later, and an example of the result is described in “procedure 6” of Table 2.

本発明のコンピュータシステム又は装置は、本発明の選択方法のみならず、本発明の深度決定方法を実行するための態様とすることができる。具体的には、塩基配列のデータから特定リード深度の範囲を選択するコンピュータシステムであって、少なくとも記録部と演算処理部を備え、下記の処理（ａ）〜（ｄ）が実行されることを特徴とするコンピュータシステム又は装置である。
（ａ）当該記録部には、マッピングアルゴリズム及び変異コールアルゴリズムに基づいて、標的生物の１以上の個体の各ＤＮＡサンプルに関する、リード深度が割り当てられた一次変異セット情報、及び、各ＤＮＡサンプルからマッピング以外の手段によって得られた二次変異セット情報が記録されており、
（ｂ）当該演算処理部では、前記記録部から読み出された一次変異セット情報を、リード深度の所定範囲を有するサブセットにグループ分けする処理が実行され、
（ｃ）各サブセットの変異情報の遺伝子型一致度の測定値が、一次変異セット情報と、前記記録部から読み出された上記二次変異セット情報との間の変異情報の遺伝子型の比較処理が実行されることによって算出され、
（ｄ）所定の閾値より高い遺伝子型一致度の測定値を有するサブセットの、リード深度の範囲を一つにまとめることによって、リード深度の適正範囲の決定処理が実行される。 The computer system or apparatus of the present invention can be an embodiment for executing not only the selection method of the present invention but also the depth determination method of the present invention. Specifically, the computer system selects a specific read depth range from base sequence data, and includes at least a recording unit and an arithmetic processing unit, and performs the following processes (a) to (d). A featured computer system or apparatus.
(A) Based on the mapping algorithm and the mutation call algorithm, the recording unit maps the primary mutation set information to which the read depth is assigned for each DNA sample of one or more individuals of the target organism, and mapping from each DNA sample Secondary mutation set information obtained by means other than is recorded,
(B) In the arithmetic processing unit, a process of grouping primary mutation set information read from the recording unit into a subset having a predetermined range of read depth is executed,
(C) A genotype comparison process of the mutation information between the primary mutation set information and the secondary mutation set information read from the recording unit as the measured value of the genotype matching degree of the mutation information of each subset Is calculated by executing
(D) The process of determining the appropriate range of the read depth is executed by combining the range of the read depth of the subset having the genotype matching degree higher than the predetermined threshold.

上記処理（ａ）〜（ｄ）は、本発明の選択方法を実行するためのコンピュータシステム又は装置における処理（ａ）〜（ｆ）のうちの（ａ）〜（ｄ）と実質的に同じである。
［本発明のコンピュータプログラム］
本発明のコンピュータプログラムの第１の態様は、本発明の選択方法を実行するためのアルゴリズムが含まれるコンピュータプログラムであり、具体的には、塩基配列のデータから特定変異情報を選択するコンピュータプログラムであって、コンピュータに下記の機能（ａ）〜（ｆ）を実現させるアルゴリズムが含まれていることを特徴とするコンピュータプログラムである。
（ａ）マッピングアルゴリズム及び変異コールアルゴリズムに基づいて、標的生物の１以上の個体の各ＤＮＡサンプルに関する、リード深度が割り当てられた一次変異セット情報、及び、各ＤＮＡサンプルからマッピング以外の手段によって得られた二次変異セット情報が記録されている記録部から、これらの情報を読み出す機能。
（ｂ）前記記録部から読み出された一次変異セット情報を、リード深度の所定範囲を有するサブセットにグループ分けする処理を実行する機能。
（ｃ）各サブセットの変異情報の遺伝子型一致度の測定値を、一次変異セット情報と、前記記録部から読み出された上記二次変異セット情報との間の変異情報の遺伝子型の比較処理を実行することによって算出する機能。
（ｄ）所定の閾値より高い遺伝子型一致度の測定値を有するサブセットの、リード深度の範囲を一つにまとめることによって、リード深度の適正範囲の決定処理を実行する機能、
（ｅ）各ＤＮＡサンプルについて一次変異セット情報から、前記（ｄ）により決定されたリード深度の適正範囲外のリード深度を有する変異情報の除外処理が実行し、各ＤＮＡサンプルについて残っている変異情報を抽出する機能。
（ｆ）全てのＤＮＡサンプルの中の少なくとも１つのＤＮＡサンプルについて、上記の処理（ａ）〜（ｅ）を実行した結果、残っている変異情報を、当該変異情報が属する遺伝子座と共に抽出して、当該遺伝情報を目的の特定変異情報として特定する機能。 The processes (a) to (d) are substantially the same as (a) to (d) among the processes (a) to (f) in the computer system or apparatus for executing the selection method of the present invention. is there.
[Computer program of the present invention]
A first aspect of the computer program of the present invention is a computer program including an algorithm for executing the selection method of the present invention. Specifically, the computer program selects specific mutation information from base sequence data. The computer program includes an algorithm for causing a computer to realize the following functions (a) to (f).
(A) Based on the mapping algorithm and the mutation call algorithm, the primary mutation set information to which the read depth is assigned for each DNA sample of one or more individuals of the target organism, and obtained by means other than mapping from each DNA sample A function to read out the information from the recording section where the secondary mutation set information is recorded.
(B) A function of performing a process of grouping primary mutation set information read from the recording unit into subsets having a predetermined range of read depth.
(C) A process for comparing genotypes of mutation information between primary mutation set information and the secondary mutation set information read from the recording unit, based on a measurement value of the genotype matching degree of mutation information of each subset A function that calculates by executing
(D) a function of executing a process of determining an appropriate range of read depth by combining the range of read depths of a subset having a measurement value of genotype matching higher than a predetermined threshold;
(E) From the primary mutation set information for each DNA sample, mutation information having a read depth outside the appropriate range of the read depth determined in (d) above is executed, and the remaining mutation information for each DNA sample The function to extract.
(F) As a result of executing the above processes (a) to (e) for at least one DNA sample among all DNA samples, the remaining mutation information is extracted together with the locus to which the mutation information belongs. The function of identifying the genetic information as the specific mutation information of interest.

上記機能（ａ）〜（ｆ）は、本発明のコンピュータシステム又は装置における、「処理（ａ）〜（ｆ）」にそのまま対応する。すなわち、本発明のコンピュータプログラムは、上記機能（ａ）〜（ｆ）を、コンピュータにおいて実現させることにより、コンピュータに上記処理（ａ）〜（ｆ）を行わせることができるアルゴリズムが含まれている。 The functions (a) to (f) correspond directly to the “processing (a) to (f)” in the computer system or apparatus of the present invention. That is, the computer program of the present invention includes an algorithm that allows the computer to perform the processes (a) to (f) by realizing the functions (a) to (f) in the computer. .

機能（ａ）を実現させるためのアルゴリズムは、上記機能（ａ）として記載された通りである。 The algorithm for realizing the function (a) is as described as the function (a).

機能（ｂ）を実現させるためのアルゴリズムは、例えば、一次変異セット情報に割り当てられたリード深度の値毎に区分けするアルゴリズムが挙げられ、当該区分けの程度は特に限定されず、「１」毎に行うことも可能であり、それより多くの値で行うことも可能である。後述する実施例では「１」毎に区分けが行われた。 As an algorithm for realizing the function (b), for example, an algorithm for classifying each read depth value assigned to the primary mutation set information can be cited, and the degree of the classification is not particularly limited, and for each “1”. It is possible to do this, and it is also possible to do it with more values. In the example described later, the classification was performed for each “1”.

機能（ｃ）を実現させるためのアルゴリズムは、例えば、前記機能（ｂ）によって得られる、一次変異セット情報に基づく、リード深度別に区分けされた変異情報のサブセット毎に、個々の変異情報について、上述した、「リファレンスホモ」、「ヘテロ」、及び、「非リファレンスホモ」に分類処理し、さらに、二次変異セット情報についても、「リファレンスホモ」、「ヘテロ」、及び、「非リファレンスホモ」に分類処理し、上記の「適合率」及び／又は「検出率」を算出・合算（平均化）処理するアルゴリズムが挙げられる。 The algorithm for realizing the function (c) is, for example, the above-described individual mutation information for each subset of the mutation information classified by read depth based on the primary mutation set information obtained by the function (b). Classification processing into “reference homo”, “hetero”, and “non-reference homo”, and secondary mutation set information also into “reference homo”, “hetero”, and “non-reference homo” An algorithm for performing classification processing and calculating / summing (averaging) the above-mentioned “adaptation rate” and / or “detection rate” is given.

機能（ｄ）を実現させるためのアルゴリズムは、例えば、「適合率」及び／又は「検出率」についての所定の閾値（上記の通り）と、リード深度別に区分けされた変異情報のサブセットについて算出・合算（平均化）処理された「適合率」及び／又は「検出率」の値を比較して、当該閾値を超える値を有するリード深度毎のサブセットを抽出して、当該サブセットのリード深度を統合して、リード深度の適正範囲を生成するアルゴリズムが挙げられる。 The algorithm for realizing the function (d) is, for example, calculated with respect to a predetermined threshold (as described above) for “accuracy rate” and / or “detection rate” and a subset of variation information classified by read depth. Comparing the summed (averaged) values of “Precision rate” and / or “Detection rate”, extracting a subset for each lead depth that has a value exceeding the threshold, and integrating the read depth of the subset Then, an algorithm for generating an appropriate range of the lead depth can be mentioned.

機能（ｅ）を実現させるためのアルゴリズムは、上記機能（ｅ）として記載された通りである。 The algorithm for realizing the function (e) is as described as the function (e).

機能（ｆ）を実現させるためのアルゴリズムは、上記機能（ｆ）として記載された通りである。 The algorithm for realizing the function (f) is as described as the function (f).

本発明のコンピュータプログラムには、コンピュータプログラムにおける（ａ）〜（ｆ）の機能を実現させるアルゴリズムに加えて、さらに、下記（α）〜（δ）から選ばれる１以上の機能の実現により、除かれるべき変異情報として定義された１以上の変異情報を、既存の第一の特定変異情報から除いて、新たな第二の特定変異情報を得る機能を実現させるアルゴリズムを含ませることができる。
（α）第一の特定変異情報が、２以上のＤＮＡサンプルを用いることによって得られる場合であって、リード深度の適正範囲外のリード深度を有する特定の変異情報の比率が、上記機能（ｆ）によって得られる変異情報の５％〜２０％を超える場合、当該特定の変異情報を、上記の除かれるべき変異情報として定義する機能。
（β）低複雑度領域に割り当てられた領域中に存在する全ての変異情報を、一次変異セット情報から抽出し、遺伝子型一致度の測定値を、当該全ての変異情報と各低複雑度領域に関する第二変異セット情報の変異情報との間で算出し、所定の閾値以下の遺伝子型一致の測定値を有する低複雑度領域の全ての変異情報を、上記の除かれるべき変異情報として定義する機能。
（γ）最初に用いられた一次変異セット情報の生成とは異なるマッピングアルゴリズム及び異なる変異コールアルゴリズムを用いることによって生成された、各ＤＮＡサンプルについての異なる一次変異セット情報又は異なる変異情報を、上記機能（ｆ）により最初に抽出した変異情報に含まれるが、当該異なる一次変異セット情報又は異なる変異情報に含まれない変異情報を、上記の除かれるべき変異情報として定義する機能。
（δ）ハーディ・ワインベルグ平衡テストに関するアルゴリズムを行って、ハーディ・ワインベルグ平衡から逸脱する変異情報を、除かれるべき変異情報として定義する機能。 In addition to the algorithm for realizing the functions (a) to (f) in the computer program, the computer program of the present invention further includes one or more functions selected from the following (α) to (δ). An algorithm that realizes a function of obtaining new second specific mutation information by removing one or more pieces of mutation information defined as the mutation information to be added from the existing first specific mutation information can be included.
(Α) When the first specific mutation information is obtained by using two or more DNA samples, the ratio of specific mutation information having a read depth outside the appropriate range of the read depth is the function (f The function of defining the specific mutation information as the above-described mutation information to be excluded when exceeding 5% to 20% of the mutation information obtained by (1).
(Β) All the mutation information existing in the area assigned to the low complexity area is extracted from the primary mutation set information, and the genotype match value is measured as the all mutation information and each low complexity area. All mutation information in the low-complexity region having a genotype-matched measurement value below a predetermined threshold is defined as the mutation information to be excluded. function.
(Γ) Different primary mutation set information or different mutation information for each DNA sample generated by using a different mapping algorithm and different mutation call algorithm from the generation of the first used primary mutation set information. A function of defining, as the mutation information to be excluded, the mutation information that is included in the mutation information extracted first in (f) but is not included in the different primary mutation set information or the different mutation information.
(Δ) A function for defining the mutation information deviating from the Hardy-Weinberg equilibrium as the mutation information to be removed by performing an algorithm related to the Hardy-Weinberg equilibrium test.

上記追加機能（α）〜（δ）は、本発明のコンピュータシステム又は装置における、「追加処理（α）〜（δ）」にそのまま対応する。すなわち、本発明のコンピュータプログラムは、上記追加機能（α）〜（δ）を、コンピュータにおいて実現させることにより、コンピュータに上記処理（α）〜（δ）を行わせることができるアルゴリズムが含まれている。 The additional functions (α) to (δ) correspond to “addition processing (α) to (δ)” as they are in the computer system or apparatus of the present invention. That is, the computer program of the present invention includes an algorithm that allows the computer to perform the processes (α) to (δ) by realizing the additional functions (α) to (δ) in the computer. Yes.

追加機能（α）を実現させるためのアルゴリズムは、例えば、第一の特定変異情報における特定の変異情報（ＳＮＶの情報等）のリード深度をＤＮＡサンプル毎に抽出して、当該リード深度が上記適正範囲外のＤＮＡサンプル数の、全体のＤＮＡサンプル数に対する比率を算出して、当該算出値が、特定の閾値（下限値として５〜２０％のいずれか）を超えるか否かを判断して、超える場合には、当該特定の変異情報を「除かれるべき変異情報」として抽出するアルゴリズムが挙げられる。 The algorithm for realizing the additional function (α) is, for example, extracting the read depth of specific mutation information (such as SNV information) in the first specific mutation information for each DNA sample, and the read depth is the above appropriate Calculate the ratio of the number of DNA samples outside the range to the total number of DNA samples, and determine whether the calculated value exceeds a specific threshold (any of 5 to 20% as a lower limit), In the case of exceeding, there is an algorithm for extracting the specific mutation information as “mutation information to be removed”.

追加機能（β）を実現させるためのアルゴリズムは、例えば、所定の低複雑度領域における変異情報を、一次変異セットと二次変異セットにおいて抽出し、「適合率」及び／又は「検出率」についての所定の閾値（上記の通り）と、リード深度別に区分けされた変異情報のサブセットについて算出・合算（平均化）処理された「適合率」及び／又は「検出率」の値を比較して、当該閾値以下の「適合率」及び／又は「検出率」の変異情報を、リード深度に係わらず「除かれるべき変異情報」として抽出するアルゴリズムが挙げられる。 As an algorithm for realizing the additional function (β), for example, mutation information in a predetermined low complexity region is extracted in a primary mutation set and a secondary mutation set, and “matching rate” and / or “detection rate” is determined. Compared with the predetermined threshold (as described above) and the value of the “compliance rate” and / or “detection rate” calculated and summed (averaged) for the subset of mutation information divided by read depth, There is an algorithm for extracting variation information of “matching rate” and / or “detection rate” below the threshold as “mutation information to be removed” regardless of the read depth.

追加機能（γ）を実現させるためのアルゴリズムは、例えば、「最初に用いられた一次変異セット情報の生成とは異なるマッピングアルゴリズム及び異なる変異コールアルゴリズムを用いることによって生成された、各ＤＮＡサンプルについての異なる一次変異セット情報又は異なる変異情報」と、「上記機能（ｆ）により最初に抽出した変異情報」の一致又は不一致を抽出し、当該不一致の変異情報を「上記機能（ｆ）により最初に抽出した変異情報から除かれるべき変異情報」として抽出するアルゴリズムが挙げられる。 The algorithm for realizing the additional function (γ) is, for example, “for each DNA sample generated by using a different mapping algorithm and a different mutation call algorithm from the generation of the first used primary mutation set information. Match or mismatch between “different primary mutation set information or different mutation information” and “mutation information first extracted by the function (f)” is extracted, and the mismatch mutation information is first extracted by the above function (f). Algorithm to extract “mutation information to be removed from the mutation information that has been performed”.

追加機能（δ）を実現させるためのアルゴリズムは、上記追加機能（δ）として記載された通りである。 The algorithm for realizing the additional function (δ) is as described as the additional function (δ).

本発明のコンピュータプログラムは、本発明の選択方法のみならず、本発明の深度決定方法をコンピュータにおいて実現するためのアルゴリズムとすることができる。具体的には、塩基配列のデータからリード深度の適正範囲を算出するためのコンピュータプログラムであって、コンピュータに下記の機能（ａ）〜（ｄ）を実現させるアルゴリズムが含まれていることを特徴とするコンピュータプログラムである。
（ａ）マッピングアルゴリズム及び変異コールアルゴリズムに基づいて、標的生物の１以上の個体の各ＤＮＡサンプルに関する、リード深度が割り当てられた一次変異セット情報、及び、各ＤＮＡサンプルからマッピング以外の手段によって得られた二次変異セット情報が記録されている記録部から、これらの情報を読み出す機能、
（ｂ）前記記録部から読み出された一次変異セット情報を、リード深度の所定範囲を有するサブセットにグループ分けする処理を実行する機能、
（ｃ）各サブセットの変異情報の遺伝子型一致度の測定値が、一次変異セット情報と、前記記録部から読み出された上記二次変異セット情報との間の変異情報の遺伝子型の比較処理を実行することによって算出する機能、
（ｄ）所定の閾値より高い遺伝子型一致度の測定値を有するサブセットの、リード深度の範囲を一つにまとめることによって、リード深度の適正範囲の決定処理を実行する機能； The computer program of the present invention can be an algorithm for realizing not only the selection method of the present invention but also the depth determination method of the present invention in a computer. Specifically, it is a computer program for calculating an appropriate range of read depth from base sequence data, and includes an algorithm that causes the computer to realize the following functions (a) to (d): Is a computer program.
(A) Based on the mapping algorithm and the mutation call algorithm, the primary mutation set information to which the read depth is assigned for each DNA sample of one or more individuals of the target organism, and obtained by means other than mapping from each DNA sample A function to read out the information from the recording section where the secondary mutation set information is recorded,
(B) a function of performing a process of grouping primary mutation set information read from the recording unit into a subset having a predetermined range of read depths;
(C) A genotype comparison process of the mutation information between the primary mutation set information and the secondary mutation set information read from the recording unit as the measured value of the genotype matching degree of the mutation information of each subset Function to calculate by executing
(D) a function of executing a process of determining an appropriate range of read depths by combining the range of read depths of a subset having a genotype matching degree measurement value higher than a predetermined threshold;

上記機能（ａ）〜（ｄ）は、本発明の選択方法を実行するためのコンピュータプログラムにおける機能（ａ）〜（ｆ）のうちの（ａ）〜（ｄ）と実質的に同じである。 The functions (a) to (d) are substantially the same as the functions (a) to (d) among the functions (a) to (f) in the computer program for executing the selection method of the present invention.

本発明のコンピュータプログラムは、Ｃ言語、Ｊａｖａ（登録商標）、Ｐｅrｌ、もしくはＰｙｔｈｏｎのような高級コンピュータ言語でのみならず、２進数によってもしくはアセンブリ言語でも記載されることが可能である。コンピュータプログラムは、必要に応じてもしくは前もって記録部に、又は、外部ハードウェアに、記録されており、記録されたアルゴリズムに従った演算処理は、必要に応じて演算処理部において行われる。 The computer program of the present invention can be written not only in a high-level computer language such as C language, Java (registered trademark), Perl, or Python, but also in binary numbers or in assembly language. The computer program is recorded in the recording unit as needed or in advance or in external hardware, and arithmetic processing according to the recorded algorithm is performed in the arithmetic processing unit as necessary.

以下、本発明の実施例を説明する。
［全ゲノムのシークエンシング］
サンプル同士の混合を防ぐために、ライブラリ構築にはゲノムＤＮＡ検体を、９６穴プレートを使って取り扱った。ゲノムＤＮＡはラボラトリ自動化システム（ＢｉｏｍｅｋＮＸＰ，Ｂｅｃｋｍａｎ社）を使って濃度２０ｎｇ／μＬになりように調製され、９６穴プレートＤＮＡ超音波破砕装置（ＣｏｖａｒｉｓＬＥ２２０，Ｃｏｖａｒｉｓ社）を使って平均５５０ｂｐになるように断片化をした。このようにして得られたＤＮＡはＢｒａｖｏリキッドハンドリング装置（ＡｇｉｌｅｎｔＴｅｃｈｎｏｌｏｇｉｅｓ社）を使ってＴｒｕＳｅｑＤＮＡＰＣＲ−ＦｒｅｅＨＴｓａｍｐｌｅｐｒｅｐｋｉｔ（Ｉｌｌｕｍｉｎａ社）でライブラリ化を行った。最終的に、調製されたライブラリはバーコードラベルで識別された１．５ｍＬの試験管に移され、ライブラリ品質管理用に変性処理を行った。 Examples of the present invention will be described below.
[ Sequencing of whole genome ]
In order to prevent mixing of samples, genomic DNA specimens were handled using a 96-well plate for library construction. Genomic DNA was prepared to a concentration of 20 ng / μL using a laboratory automation system (Biomek NXP, Beckman) and averaged to 550 bp using a 96-well plate DNA sonicator (Covaris LE220, Covaris). Fragmented. The DNA thus obtained was converted into a library using a TruSeq DNA PCR-Free HT sample prep kit (Illumina) using a Bravo liquid handling apparatus (Agilent Technologies). Finally, the prepared library was transferred to a 1.5 mL test tube identified by a barcode label and subjected to denaturation treatment for library quality control.

ライブラリの定量化および品質管理は新規に開発した定量ＭｉＳｅｑ法（ｑＭｉＳｅｑ）で実施した。この手法ではまず、８μＬまたは１０μＬの調整済みライブラリを等量の０．１ＮＮａＯＨと一緒に５分間変性させた後、４９倍の体積の氷冷した「ＩｌｌｕｍｉｎａＨＴ１緩衝液」で希釈を行う。次に、予めＨｉＳｅｑで解析済みの３つの対象サンプルを含む９６個の５０μＬ希釈済みライブラリを混合する。このうち６０μＬを、５４０μＬの氷冷したＩｌｌｕｍｉｎａＨＴ１緩衝液で希釈し、ＭｉＳｅｑの２５ｂｐペアエンドプロトコルで解析を行う。ＭｉＳｅｑシークエンサーの解析で決められたインデックス比はＨｉＳｅｑの実行にあたり実行条件を決めるための相対濃度として利用する。ｑＭｉＳｅｑ法の詳細は手法論文として既に公知である［Katsuoka F, et al. An efficient quantitation method of next-generation sequencing libraries by using MiSeq sequencer. Anal. Biochem. 466c, 27-29 (2014)］。また、ＤＮＡライブラリの品質管理を行うに際して、ＭｉＳｅｑＱＣに加えてＦｒａｇｍｅｎｔＡｎａｌｙｚｅｒソフトウェア（ＡｄｖａｎｃｅｄＡｎａｌｙｔｉｃａｌ社）も利用した。 Library quantification and quality control were performed with the newly developed quantitative MiSeq method (qMiSeq). In this procedure, 8 μL or 10 μL of the prepared library is first denatured with an equal amount of 0.1N NaOH for 5 minutes and then diluted with 49 times volume of ice-cooled “Illumina HT1 buffer”. Next, 96 50 μL diluted libraries containing 3 target samples previously analyzed with HiSeq are mixed. Of this, 60 μL is diluted with 540 μL of ice-cold Illumina HT1 buffer and analyzed using the MiSeq 25 bp paired end protocol. The index ratio determined by the analysis of the MiSeq sequencer is used as a relative concentration for determining the execution condition when executing HiSeq. Details of the qMiSeq method are already known as method papers [Katsuoka F, et al. An efficient quantitation method of next-generation sequencing libraries by using MiSeq sequencer. Anal. Biochem. 466c, 27-29 (2014)]. In addition to the MiSeq QC, Fragment Analyzer software (Advanced Analytical) was also used for quality control of the DNA library.

前述のＤＮＡライブラリは製造者の手順書に従ってＨｉＳｅｑ２５００シークエンサーで解析に使った。解析はＴｒｕＳｅｑＲａｐｉｄＰＥＣｌｕｓｔｅｒＫｉｔ（Ｉｌｌｕｍｉｎａ社）とｏｎｅ−ａｎｄ−ａ−ｈａｌｆＴｒｕＳｅｑＲａｐｉｄＳＢＳＫｉｔ（２００ｃｙｃｌｅｓ，Ｉｌｌｕｍｉｎａ社）を使って１６２ｂｐペアエンドのＲａｐｉｄ−ＲｕｎＭｏｄｅで実行した。ｑＭｉＳｅｑの結果に基づきライブラリを適切な濃度に調整し、ｏｎ−ｂｏａｒｄクラスタ形成（Ｉｌｌｕｍｉｎａ社）が行われた。最初に報告されるクラスタ密度を確認し、その密度（約５５０から６５０Ｋ／ｍｍ²）によって解析を継続するかの判断を行う作業を反復的に行った。 The aforementioned DNA library was used for analysis with a HiSeq 2500 sequencer according to the manufacturer's protocol. The analysis was performed with a Rapid-Run Mode of 162 bp pair end using TruSeq Rapid PE Cluster Kit (Illumina) and one-and-a-half TruSeq Rapid SBS Kit (200 cycles, Illumina). Based on the result of qMiSeq, the library was adjusted to an appropriate concentration, and on-board cluster formation (Illumina) was performed. The cluster density reported first was confirmed, and the operation of determining whether to continue the analysis depending on the density (about 550 to 650 K / mm ² ) was repeated.

［選択手順１］各ＤＮＡサンプルのアライメントと変異コール
ＨｉＳｅｑシークエンサーから出力された各ゲノムＤＮＡサンプル（以下、ＤＮＡサンプルとも記載する）のシークエンスリードを、参照ゲノム配列（ＧＲＣｈ３７／ｈｇ１９）にデコイ配列（ｈｓ３７ｄ５）とともにアライメントした。アライメントは“−Ｘ２０００”オプションを使ったＢｏｗｔｉｅ２（ｖｅｒｓｉｏｎ２．１．０）と、標準オプションを使ったＢＷＡ−ＭＥＭ（ｖｅｒ．０．７．５ａ−ｒ４０５）で行った。変異コールは、これらのアライメント結果を使って「Ｂｃｆｔｏｏｌｓｓｏｆｔｗａｒｅ」（ｖｅｒ．０．１．１７−ｄｅｖ）と「ｔｈｅＧｅｎｏｍｅＡｎａｌｙｓｉｓＴｏｏｌｋｉｔ」（ＧＡＴＫｖｅｒｓｉｏｎ２．５−２）で行った（図１の「手順１」）。以後のフィルタリングで使うために、各ＤＮＡサンプルの全ＳＮＶ部位のリード深度を計算した（図１の「手順１」の中の最下行）。ここでリード深度は、各ＳＮＶポジションにおいて、マッピングクオリティ５以上でアラインメントが行われたシークエンスリードの数であることを確認する。 [ Selection Procedure 1 ] Alignment and Mutation Call of Each DNA Sample The sequence read of each genomic DNA sample (hereinafter also referred to as a DNA sample) output from the HiSeq sequencer is used as a decoy sequence (hs37d5) as a reference genomic sequence (GRCh37 / hg19). ). Alignment was performed with Bowtie 2 (version 2.1.0) using the “-X 2000” option and BWA-MEM (ver. 0.7.5a-r405) using the standard option. Mutation calls were made using “Bcftools software” (ver. 0.1.17-dev) and “the Genome Analysis Toolkit” (GATK version 2.5-2) using these alignment results (“FIG. 1“ GATK version 2.5-2 ”). Procedure 1 "). For use in subsequent filtering, the read depth of all SNV sites of each DNA sample was calculated (the bottom row in “Procedure 1” in FIG. 1). Here, it is confirmed that the lead depth is the number of sequence leads that are aligned with a mapping quality of 5 or more at each SNV position.

次にＳＮＰアレイで設計されている部位の適合率（ｐｒｅｃｉｓｉｏｎ）と検出率（ｐｅｗｅｒ）（＝再現率（ｒｅｃａｌｌ））を評価するために、ＳＮＶコールと同一サンプルの解析を行った「ＨｕｍａｎＯｍｎｉ２．５−８ＢｅａｄＣｈｉｐ」のジェノタイプコールとの比較を行った。Ｂｏｗｔｉｅ２のアライメントに対してＢｃｆｔｏｏｌｓの変異コーラーを使って得られたＳＮＶ（Ｂｏｗｔｉｅ２＋Ｂｃｆｔｏｏｌｓ）の適合率は、常にＢＷＡ−ＭＥＭのアライメントに対してＧＡＴＫの変異コーラーを適合して得られたＳＮＶ（ＢＷＡ−ＭＥＭ＋ＧＡＴＫ）より高かった。一方、検出率はＢＷＡ−ＭＥＭ＋ＧＡＴＫの方が、Ｂｏｗｔｉｅ２＋Ｂｃｆｔｏｏｌｓよりも高かった。この結果は先行研究の報告と一致している［Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843-2851 (2014)］。これらの結果から、「ＢＷＡ−ＭＥＭ＋ＧＡＴＫ」で得られたＳＮＶは高感度、つまり可能な限り多くのＳＮＶ候補を発見することを目指した結果であると考えることができる。一方、「Ｂｏｗｔｉｅ２＋Ｂｃｆｔｏｏｌｓ」で得られたＳＮＶは信頼度が高く、以下のフィルタリングステップにおける、高信頼のＳＮＶ候補として使うことにする。以下断りがない限り、「ＳＮＶ」は、「Ｂｏｗｔｉｅ２＋Ｂｃｆｔｏｏｌｓ」で変異コールされたバリアントを指すこととする。 Next, in order to evaluate the precision and detection rate (= recall) of the sites designed in the SNP array, the same sample as the SNV call was analyzed, “HumanOmni2.5 A comparison was made with the genotype call of “-8 BeadChip”. The SNV (Bowtie2 + Bcftools) match rate obtained using the Bcftools mutation caller for the Bowtie2 alignment is always SNV (BWA-MEM + GATK) obtained by fitting the GATK mutation caller to the BWA-MEM alignment. ) Was higher. On the other hand, the detection rate of BWA-MEM + GATK was higher than that of Bowtie2 + Bcftools. This result is consistent with previous studies [Li H. Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843-2851 (2014)]. From these results, it can be considered that the SNV obtained by “BWA-MEM + GATK” is a result aimed at high sensitivity, that is, finding as many SNV candidates as possible. On the other hand, the SNV obtained by “Bowtie2 + Bcftools” has high reliability, and is used as a highly reliable SNV candidate in the following filtering step. Unless otherwise specified, “SNV” refers to a variant that has been mutated and called by “Bowtie2 + Bcftools”.

［選択手順２］ＤＮＡサンプル毎のリード深度に基づくＳＮＶのフィルタリング
極端にリード深度が高いまたは低い遺伝型をＤＮＡサンプル毎にフィルタリングした。ＤＮＡサンプル毎に、「ＨｕｍａｎＯｍｎｉ２．５−８ＢｅａｄＣｈｉｐ」のジェノタイピング結果との比較から計算された、適合率と検出率に基づき、遺伝型を残すリード深度の範囲を決定した。ＮＧＳのリード深度によってＳＮＶをグループ分けし、グループごとに適合率と検出率を計算した（図１の「手順２」の真ん中のグラフ）。その際、複数の染色体にアラインメントされたリードの影響を排除するために、マッピングクオリティ５以上のリードのみを考慮した。その結果、適合度が０．９９８よりも大きく、かつ検出率が０．９５よりも大きなグループに属するＳＮＶのリード深度の範囲にある遺伝型を信頼できるＳＮＶとして残した（図１の「手順２」）。この選択手順２において、２％の分離サイトが取り除かれた（表２の「手順２」）。 [ Selection Procedure 2 ] SNV Filtering Based on Read Depth for Each DNA Sample Genotypes with extremely high or low read depth were filtered for each DNA sample. For each DNA sample, the range of the read depth that leaves the genotype was determined based on the precision and detection rate calculated from the comparison with the genotyping results of “HumanOmni2.5-8 BeadChip”. SNVs were grouped according to the lead depth of NGS, and the precision and detection rate were calculated for each group (middle graph of “Procedure 2” in FIG. 1). At that time, only reads with a mapping quality of 5 or higher were considered in order to eliminate the influence of reads aligned to multiple chromosomes. As a result, a genotype in the range of read depths of SNVs belonging to a group having a fitness greater than 0.998 and a detection rate greater than 0.95 is left as a reliable SNV (see “Procedure 2” in FIG. 1). "). In this selection procedure 2, 2% of the separation sites were removed ("Procedure 2" in Table 2).

表２における「既知変異」は、「ｄｂＳＮＰｂｕｉｌｄ１３８」において報告されているＳＮＶである。 The “known mutation” in Table 2 is the SNV reported in “dbSNP build 138”.

［選択手順３］ＳＮＶ部位毎のリード深度に基づくＳＮＶのフィルタリング
前記選択手順２で、リード深度に基づくフィルタリングが適用された結果、全ＤＮＡサンプルの１０％以上の遺伝型が存在しないＳＮＶは、部位全体を取り除いた。例えば、反復配列へのリードのアラインメントの結果、極端に高いまたは低いリード深度になるような部位はこのフィルタリングの対象になりうる。このフィルタリングによって残るＳＮＶ部位の割合を計算した（図１の「手順３」の真ん中）。この選択手順３において、４．９９％のＳＮＶ部位が取り除かれた（上記表２の「手順３」）。 [ Selection Procedure 3 ] SNV Filtering Based on Read Depth for Each SNV Site As a result of applying the filtering based on the read depth in the selection procedure 2, an SNV in which genotypes of 10% or more of all DNA samples do not exist The whole was removed. For example, sites that lead to extremely high or low read depths as a result of aligning reads to repetitive sequences may be subject to this filtering. The percentage of SNV sites remaining by this filtering was calculated (middle of “Procedure 3” in FIG. 1). In this selection procedure 3, 4.99% of SNV sites were removed ("Procedure 3" in Table 2 above).

［選択手順４］ゲノムの複雑性に基づくフィルター
縦列型反復配列等の配列の複雑度が低い領域の正確な変異コールは、一般的に困難である。ＲｅｐｅａｔＭａｓｋｅｒプログラム[Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0. 2013-2015 <http://www.repeatmasker.org>.]は５６％のゲノム領域がこのような低複雑性の領域だと判定している。ＲｅｐｅａｔＭａｓｋｅｒで判定されたリピートのグループ毎に、同一サンプルのＳＮＰアレイジェノタイピングとの間の遺伝型の適合度の計算を行った（図１の「手順４」の左側の表）。各ゲノム領域のＳＮＶコールの適合度は、「ＨｕｍａｎＯｍｎｉ２．５−８ＢｅａｄＣｈｉｐ」の解析で得られた遺伝型コールの結果に基づいて算出し、適合度が０．９９７以下になったゲノム領域（図１の「手順４」の右図）のＳＮＶは取り除いた。低複雑性のゲノム領域には、Ａｌｕ、ＥＲＶＫ、Ｌｏｗ＿ｃｏｍｐｌｅｘｉｔｙ、Ｓａｔｅｌｌｉｔｅ、Ｓｉｍｐｌｅ＿ｒｅｐｅａｔ、ＴｃＭａｒ−Ｍａｒｉｎｅｒ、ＣＲ１、ＤＮＡ、Ｄｅｕ、Ｄｏｎｇ−Ｒ４，ＥＲＶ、ＥＲＶ１、ＥＲＶＫ、ＥＲＶＬ，ＥＲＶＬ−ＭａＬＲ、ＧＥＮＣＯＤＥ、Ｇｙｐｓｙ、Ｈｅｌｉｔｒｏｎ、Ｌ１、Ｌ２，ＬＴＲ、ＭＩＲ、Ｍｅｒｌｉｎ，ＭｕＤＲ，Ｐｅｎｅｌｏｐｅ、ＰｉｇｇｙＢａｃ、ＲＮＡ、ＲＴＥ、ＲＴＥ−ＢｏｖＢ、ＳＩＮＥ、ＴｃＭａｒ、ＴｃＭａｒ−Ｔｃ２、ＴｃＭａｒ−Ｔｉｇｇｅｒ、ａｃｒｏ、ｃｅｎｔｒ、ｈＡＴ、ｈＡＴ−Ｂｌａｃｋｊａｃｋ、ｈＡＴ−Ｃｈａｌｉｅ、ｈＡＴ−Ｔｉｐ１００、ｉｎｔｅｒｇｅｎｉｃ、ｒＲＮＡ、ｓｃＲＮＡ、ｓｎＲＮＡ、ｓｒｐＲＮＡ、ｔＲＮＡ、ｔｅｌｏ等が含まれる。この手順で１４．２２％のＳＮＶが取り除かれた（上記表２の「手順４」）。 [ Selection Procedure 4 ] Filter based on genome complexity It is generally difficult to accurately call mutations in regions with low sequence complexity such as tandem repeats. The RepeatMasker program [Smit, AFA, Hubley, R & Green, P. RepeatMasker Open-4.0. 2013-2015 <http://www.repeatmasker.org>.] Has 56% of the genomic region with such low complexity. Judged to be an area. For each group of repeats determined by Repeat Masker, genotypic fitness with SNP array genotyping of the same sample was calculated (the table on the left side of “Procedure 4” in FIG. 1). The SNV call fitness of each genomic region was calculated based on the result of genotype call obtained in the analysis of “HumanOmni2.5-8 BeadChip”. 1) (the right figure of “Procedure 4”) was removed. Low complexity genomic regions include Alu, ERVK, Low_complexity, Satelite, Simple_repeat, TcMar-Mariner, CR1, DNA, Deu, Dong-R4, ERV, ERV1, ERVK, ERVL, ERVL-MaLR, GENCODEHel, Gyps, Elps , L1, L2, LTR, MIR, Merlin, MuDR, Penelope, PiggyBac, RNA, RTE, RTE-BovB, SINE, TcMar, TcMar-Tc2, TcMar-Tigger, acro, centr, hAT, hAT-BlackAThack HAT-Tip100, intelligent, rRNA, scRNA, snRNA, srpRNA, tRNA, elo and the like. This procedure removed 14.22% of the SNV (“Procedure 4” in Table 2 above).

［選択手順５］解析手法の偏りに対するフィルター
上記選択手順１の解析手法の偏りを制御するために、異なる解析手法においても発見されたＳＮＶ部位のみを残した。具体的には、上述したように、高感度の検出が行われる「ＢＷＡ−ＭＥＭ＋ＧＡＴＫ」を「他の手法」として用い、当該「他の手法」では発見されなかったＳＮＶを、追加除外した。例えば、図１の「手順５」の上図において、最も左のＳＮＶは他の手法（「ＢＷＡ−ＭＥＭ＋ＧＡＴＫ」）では発見されなかったので、このＳＮＶは以後の解析から除外した。この手順で、０．５７％のＳＮＶが除去された（上記表２の「手順５」）。 [ Selection procedure 5 ] Filter against bias in analysis method In order to control the bias in the analysis method in the selection procedure 1, only SNV sites found in different analysis methods were left. Specifically, as described above, “BWA-MEM + GATK” in which high-sensitivity detection is performed is used as “another method”, and SNVs that are not found in the “other method” are additionally excluded. For example, in the upper diagram of “Procedure 5” in FIG. 1, the leftmost SNV was not found by other methods (“BWA-MEM + GATK”), so this SNV was excluded from the subsequent analysis. This procedure removed 0.57% of the SNV (“Procedure 5” in Table 2 above).

［選択手順６］集団遺伝学的なフィルタリング
上記選択手順５で得られたＳＮＶの遺伝型頻度から計算される、ハーディーワインバーグ平衡検定でｐ値が１０^-5を下回るＳＮＶを除去した。このフィルターは遺伝型頻度が、ハーディーワインバーグ平衡から逸脱したＳＮＶの除去を意図している。ここで除去される多くの場合は、参照配列自体が不完全であったり、系統的なアラインメントエラーによるアーティファクトであることが想定される。この手順で１．０３％のＳＮＶが取り除かれ、最終的に当初のＳＮＶの内７７．２０％がフィルタリング後に残った（図１の「手順６」と上記表２の「手順６」）。 [ Selection Procedure 6 ] Population Genetic Filtering SNVs having a p value of less than 10 ⁻⁵ were removed by Hardy-Weinberg equilibrium test, which was calculated from the SNV genotype frequency obtained in the above selection procedure 5. This filter is intended to remove SNVs whose genotype frequencies deviate from Hardy Weinberg equilibrium. In many cases removed here, it is assumed that the reference sequence itself is incomplete or is an artifact due to a systematic alignment error. This procedure removed 1.03% of the SNV, and finally 77.20% of the original SNV remained after filtering (“Procedure 6” in FIG. 1 and “Procedure 6” in Table 2 above).

Claims

A method for selecting gene mutation information by a computer, wherein the following steps (1) to (3) are executed.
(1) The following steps for each DNA sample:
(1-1) Based on the mapping algorithm and the mutation call algorithm, the primary mutation set information to which the read depth is assigned for each DNA sample of one or more individuals of the target organism is grouped into a subset having a predetermined range of the read depth. Divided,
(1-2) Genes of mutations between the primary mutation set information and the secondary mutation set information obtained from each DNA sample by means other than mapping, based on the measured value of the genotype coincidence of each subset of mutations Calculate by comparing types,
(1-3) determining an appropriate range of read depths by combining the range of read depths of subsets having a genotype match measurement value higher than a predetermined threshold;
(2) For each DNA sample, from the primary mutation set information in step (1-1), removing mutations having a read depth outside the appropriate range of read depths to obtain remaining mutation information for each DNA sample;
(3) As a result of performing the above steps (1) and (2) for at least one DNA sample among all DNA samples, the remaining mutation information is extracted together with the locus to which the mutation information belongs. And selecting as target mutation information.

The method according to claim 1, wherein the gene mutation information includes gene mutation information for single nucleotide mutation (SNV), insertion, or deletion.

The method according to claim 1, wherein the target organism is a human.

The method according to any one of claims 1 to 3, wherein means other than mapping for obtaining secondary mutation set information is a SNP array.

The method according to any one of claims 1 to 4, wherein the measured value of genotype coincidence is precision and / or detection rate (power).

The method according to any one of claims 1 to 5, wherein the predetermined threshold value in the step (1-3) is a matching rate of 0.995 or more.

The method according to any one of claims 1 to 5, wherein the predetermined threshold value in the step (1-3) is a matching rate of 0.995 or more and a detection rate of 0.93 or more. .

1 defined as mutation information to be removed in one or more steps selected from the following (A) to (D) by performing the above steps (1) to (3) to obtain first specific mutation information. The above mutation information is removed from the first specific mutation information to obtain second specific mutation information, and the second specific mutation information is selected as target mutation information. The method in any one of 1-7.
(A) When the first specific mutation information is obtained by using two or more DNA samples, the ratio of the specific mutation information having a read depth outside the appropriate range of the read depth is the step (3). If more than 5% to 20% in the mutation site in the, the specific mutation information is defined as the mutation information to be excluded above;
(B) All the mutation information existing in the area assigned to the low complexity area is selected from the primary mutation set information, and the measured value of the genotype coincidence is the all mutation information and each low complexity area. All mutation information in the low-complexity region that is calculated between the mutation information in the second mutation set information and has a genotype-matched measurement value below a predetermined threshold is defined as the mutation information to be excluded. ;
(C) By using a different mapping algorithm and different mutation call algorithm from the generation of the primary mutation set information used in the first step (1), step (1) or steps (1) to (3) Different primary mutation set information or different mutation information generated is obtained for each DNA sample, and mutation information that is included in the first mutation information but not in the different primary mutation set information or different mutation information is excluded. Defined as power variation information;
(D) An algorithm related to the Hardy-Weinberg equilibrium test is performed, and mutation information deviating from the Hardy-Weinberg equilibrium is defined as mutation information to be removed.

The method according to claim 8, wherein the ratio of mutation information having a lead depth outside the appropriate range of the lead depth exceeds 7% to 15% in the step (A).

The method according to claim 8 or 9, wherein the secondary mutation set information in the step (B) is obtained by using a SNP array.

The method according to any one of claims 8 to 10, wherein the measured value of the genotype coincidence in the step (B) is a precision and / or a detection rate.

12. The method according to claim 11, wherein the predetermined threshold value in the step (B) is a relevance ratio of 0.995 or more.

The method according to claim 11, wherein the predetermined threshold value in the step (B) is a matching rate of 0.995 or higher and a detection rate of 0.93 or higher.

A computer system for selecting specific mutation information from base sequence data, comprising at least a recording unit and an arithmetic processing unit, and the following processes (a) to (f):
(A) Based on the mapping algorithm and the mutation call algorithm, the recording unit maps the primary mutation set information to which the read depth is assigned for each DNA sample of one or more individuals of the target organism, and mapping from each DNA sample Secondary mutation set information obtained by means other than is recorded,
(B) In the arithmetic processing unit, a process of grouping primary mutation set information read from the recording unit into a subset having a predetermined range of read depth is executed,
(C) A genotype comparison process of the mutation information between the primary mutation set information and the secondary mutation set information read from the recording unit as the measured value of the genotype matching degree of the mutation information of each subset Is calculated by executing
(D) The determination process of the appropriate range of the read depth is performed by combining the range of the read depth of the subset having the genotype matching degree higher than the predetermined threshold into one,
(E) From the primary mutation set information for each DNA sample, mutation information having a read depth outside the appropriate range of the read depth determined in (d) above is executed, and the remaining mutation information for each DNA sample Is extracted,
(F) As a result of performing the above processing (a) to (e) for at least one DNA sample among all DNA samples, the remaining mutation information is extracted together with the gene locus to which the mutation information belongs. , A process for identifying the genetic information as target specific mutation information;
A computer system characterized in that is executed.

The specific mutation information obtained by the execution of the computer system is defined as the first specific mutation information, and is further defined as mutation information to be removed from one or more processes selected from the following (α) to (δ) The above-described mutation information is removed from the first specific mutation information to obtain the second specific mutation information, and the second specific mutation information is extracted as target mutation information. The computer system according to claim 14, wherein:
(Α) When the first specific mutation information is obtained by using two or more DNA samples, the ratio of the specific mutation information having a read depth outside the appropriate range of the read depth is When 5% to 20% of the mutation information obtained by the process (f) is exceeded, the specific mutation information is defined as the mutation information to be removed;
(Β) All the mutation information existing in the area assigned to the low complexity area is extracted from the primary mutation set information, and the measured value of the genotype coincidence is the all mutation information and each low complexity area. All mutation information in the low-complexity region that is calculated and processed with the mutation information in the second mutation set information regarding and has a genotype-matched measurement value below a predetermined threshold is defined as the mutation information to be excluded. Processing;
(Γ) different primary mutation set information or different mutation information for each DNA sample generated by using a different mapping algorithm and different mutation call algorithm from the generation of the first used primary mutation set information. A process in which the mutation information that is included in the mutation information extracted first by the process (f) of 14 but is not included in the different primary mutation set information or the different mutation information is defined as the mutation information to be removed;
(Δ) A process in which an algorithm related to the Hardy-Weinberg equilibrium test is performed, and mutation information deviating from the Hardy-Weinberg equilibrium is defined as mutation information to be removed.

A computer program for selecting specific mutation information from base sequence data, and the following functions (a) to (f):
(A) Based on the mapping algorithm and the mutation call algorithm, the primary mutation set information to which the read depth is assigned for each DNA sample of one or more individuals of the target organism, and obtained by means other than mapping from each DNA sample A function to read out the information from the recording section where the secondary mutation set information is recorded,
(B) a function of performing a process of grouping primary mutation set information read from the recording unit into a subset having a predetermined range of read depths;
(C) A genotype comparison process of the mutation information between the primary mutation set information and the secondary mutation set information read from the recording unit as the measured value of the genotype matching degree of the mutation information of each subset Function to calculate by executing
(D) a function of executing a process of determining an appropriate range of read depth by combining the range of read depths of a subset having a measurement value of genotype matching higher than a predetermined threshold;
(E) From the primary mutation set information for each DNA sample, mutation information having a read depth outside the appropriate range of the read depth determined in (d) above is executed, and the remaining mutation information for each DNA sample The ability to extract,
(F) As a result of executing the above processes (a) to (e) for at least one DNA sample among all DNA samples, the remaining mutation information is extracted together with the locus to which the mutation information belongs. , The function of identifying the genetic information as the specific mutation information of interest;
The computer program characterized by including the algorithm which implement | achieves.

In addition to the algorithm for realizing the functions (a) to (f) in the computer program, it is further defined as mutation information to be removed by realizing one or more functions selected from the following (α) to (δ). The one or more mutation information is removed from the existing first specific mutation information, and an algorithm for realizing a function of obtaining new second specific mutation information is included. Computer program.
(Α) When the first specific mutation information is obtained by using two or more DNA samples, the ratio of the specific mutation information having a read depth outside the appropriate range of the read depth is the function (f) A function that defines the specific mutation information as the above-described mutation information to be excluded when exceeding 5% to 20% of the mutation information obtained by
(Β) All the mutation information existing in the area assigned to the low complexity area is extracted from the primary mutation set information, and the genotype match value is measured as the all mutation information and each low complexity area. All mutation information in the low-complexity region having a genotype-matched measurement value below a predetermined threshold is defined as the mutation information to be excluded. function;
(Γ) Different primary mutation set information or different mutation information for each DNA sample generated by using a different mapping algorithm and different mutation call algorithm from the generation of the first used primary mutation set information. a function of defining, as the above-described mutation information to be excluded, the mutation information that is included in the mutation information extracted first by f) but not included in the different primary mutation set information or the different mutation information;
(Δ) A function for defining the mutation information deviating from the Hardy-Weinberg equilibrium as the mutation information to be removed by performing an algorithm related to the Hardy-Weinberg equilibrium test.

A digital program in which the computer program according to claim 16 or 17 is stored in a computer-readable state.

Mutation information obtained by the method according to claim 1.

A method for determining an appropriate range of lead depth by a computer, comprising the following steps (1) to (3).
(1) Based on the mapping algorithm and the mutation call algorithm, the primary mutation set information to which the read depth is assigned for each DNA sample of one or more individuals of the target organism, the primary mutation set information, and the predetermined range of the read depth Group into subsets with
(2) The gene of the mutation information between the primary mutation set information and the secondary mutation set information obtained from each DNA sample by means other than mapping, using the measured value of the genotype coincidence of the mutation information of each subset. Calculate by comparing types,
(3) The appropriate range of the read depth is determined by combining the read depth ranges of the subsets having the genotype matching degree higher than the predetermined threshold.

21. The method according to claim 20, wherein the measured value of genotype coincidence is a precision and / or detection rate.

A computer program for calculating an appropriate range of read depth from base sequence data, and the following functions (a) to (d):
(A) Based on the mapping algorithm and the mutation call algorithm, the primary mutation set information to which the read depth is assigned for each DNA sample of one or more individuals of the target organism, and obtained by means other than mapping from each DNA sample A function to read out the information from the recording section where the secondary mutation set information is recorded,
(B) a function of performing a process of grouping primary mutation set information read from the recording unit into a subset having a predetermined range of read depths;
(C) A genotype comparison process of the mutation information between the primary mutation set information and the secondary mutation set information read from the recording unit as the measured value of the genotype matching degree of the mutation information of each subset Function to calculate by executing
(D) a function of executing a process of determining an appropriate range of read depths by combining the range of read depths of a subset having a genotype matching degree measurement value higher than a predetermined threshold;
The computer program characterized by including the algorithm which implement | achieves.

23. A digital medium in which the computer program according to claim 22 is stored in a computer-readable state.