JP2016077227A

JP2016077227A - Genomic-analysis apparatus, genomic-analysis method, and genomic-analysis program

Info

Publication number: JP2016077227A
Application number: JP2014212799A
Authority: JP
Inventors: 仁裕朝野; Hitohiro Asano; 成二 ▲高▼島; Seiji Takashima; 敦子今井; Atsuko Imai; 明弘中谷; Akihiro Nakatani
Original assignee: Osaka University NUC
Current assignee: Osaka University NUC
Priority date: 2014-10-17
Filing date: 2014-10-17
Publication date: 2016-05-16

Abstract

PROBLEM TO BE SOLVED: To provide a technique which enables identification of the responsible mutation related to a particular hereditary phenotype with high accuracy.SOLUTION: A genomic-analysis apparatus related to the one aspect of the invention comprises: a candidate mutation evaluating part which calculates evaluation values for evaluating specificity of each candidate mutation by obtaining each genome base sequence of a target organism with a particular hereditary phenotype and a plurality of control organisms without the hereditary phenotype, calculating the different degree of the mutation part between the target organism and each control organism, and the different degrees of the mutation parts among a plurality of control organisms, for every partial genome base sequence comprising the candidate mutation which serves as a candidate of the responsible mutation of the hereditary phenotype, and examining whether the difference between both calculated different degrees is significant; and an output control part which outputs the result of the examination in the condition that the order of each candidate mutation can be identified based on the evaluation values.SELECTED DRAWING: Figure 4

Description

本発明は、ゲノム解析装置、ゲノム解析方法及びゲノム解析プログラムに関する。 The present invention relates to a genome analysis device, a genome analysis method, and a genome analysis program.

ヒトのゲノム情報を用いた個人化医療の実現及び疾患原因の追究に係る医療研究の応用は、医療分野に広がるビックデータ時代の重点課題である。従来、疾患原因の同定手法として、例えば、非特許文献１では、同一の遺伝性疾患を共有する家系で共通に生じている変異箇所を特定することによって、疾患原因の同定を行う連鎖解析の手法が提案されている。 The realization of personalized medicine using human genome information and the application of medical research related to the pursuit of the cause of the disease are priority issues in the big data era that extends to the medical field. Conventionally, as a method for identifying a cause of a disease, for example, in Non-Patent Document 1, a linkage analysis method for identifying a cause of a disease by specifying a mutation site that commonly occurs in a family sharing the same genetic disease Has been proposed.

また、例えば、非特許文献２では、遺伝性疾患の原因となる遺伝子の変異（以下、「責任変異」とも称する）を示す文献等の既知の情報を変異データベースに収集し、対象の生物に生じた疾患の候補となる複数の候補変異を当該変異データベースに比較照合を行うことで、各候補変異を評価付けし、疾患原因の同定を行う手法が提案されている。 Further, for example, in Non-Patent Document 2, known information such as documents showing gene mutations causing genetic diseases (hereinafter also referred to as “responsible mutations”) are collected in a mutation database and generated in the target organism. A method for evaluating each candidate mutation and identifying the cause of the disease by comparing and comparing a plurality of candidate mutations that are candidates for the disease with the mutation database has been proposed.

G. M. Lathrop, J. M. Lalouel, C. Julier, J. Ott, "Strategies for multilocus linkage analysis in humans", Proc. Natl. Acad. Sci. USA, 1984, p.3443-3446G. M. Lathrop, J. M. Lalouel, C. Julier, J. Ott, "Strategies for multilocus linkage analysis in humans", Proc. Natl. Acad. Sci. USA, 1984, p.3443-3446 Mark Yandell, Chad Huff, Hao Hu, et al. "A probabilistic disease-gene finder for personal genomes", Genome Res. 2011;21, p.1529-1542Mark Yandell, Chad Huff, Hao Hu, et al. "A probabilistic disease-gene finder for personal genomes", Genome Res. 2011; 21, p.1529-1542

近年、次世代シーケンサーを用いたゲノム情報の解析技術の登場により、ゲノム塩基配列において、変異箇所の同定を大量に行うことが可能となった。しかしながら、疾患原因である責任変異を同定することは未だ非常に困難である。 In recent years, with the advent of genome information analysis technology using next-generation sequencers, it has become possible to identify a large amount of mutation sites in genome base sequences. However, it is still very difficult to identify the responsible mutation responsible for the disease.

例えば、上記非特許文献１のような連鎖解析の手法では、数世代にわたる家系のゲノム情報を取得できなければ、疾患原因の同定を行うことは難しい。そのため、核家族化、少子化及び晩婚化の進む今日の社会では、数世代にわたってゲノム情報を取得することは非常に難しく、当該連鎖解析の手法を用いることが可能な場面は非常に限定的になると予想される。また、上記非特許文献２のような既知の変異データベース情報の比較参照に基づく方法では、未知の疾患原因には対応し辛いため、遺伝性疾患の新たな責任変異を同定することは難しい。 For example, with the linkage analysis technique as described in Non-Patent Document 1, it is difficult to identify the cause of a disease unless genome information of several families over several generations can be acquired. Therefore, in today's society where nuclear families, declining birthrates, and late marriage are progressing, it is very difficult to acquire genome information for several generations, and the situations where this linkage analysis method can be used are very limited. Expected to be. Further, in the method based on comparative reference of known mutation database information as in Non-Patent Document 2, it is difficult to cope with an unknown cause of disease, so it is difficult to identify a new responsible mutation of a genetic disease.

すなわち、ゲノム解析により得られた情報から疾患原因を同定する試みがなされているが、従来の疾患原因の同定手法には上記のような限界があり、疾患原因の同定には更なる困難が今後も予想されるという問題点があった。なお、本問題点は、ヒト以外の生物にも当てはまる。 In other words, attempts have been made to identify the cause of the disease from the information obtained by genome analysis. However, conventional methods for identifying the cause of the disease have the limitations described above, and it will become more difficult to identify the cause of the disease in the future. There was also a problem that was expected. This problem also applies to organisms other than humans.

本発明は、一側面では、このような点を考慮してなされたものであり、遺伝性家系における症例比較及び既知の変異データベース情報に基づく比較参照に寄らなくても、特定の遺伝性の表現型に係る責任変異を精度よく同定可能にする技術を提供することを目的とする。 In one aspect, the present invention has been made in consideration of such points, and it is possible to express specific heritability without relying on case comparisons in genetic families and comparative references based on known mutation database information. It is an object to provide a technique that makes it possible to accurately identify a responsible mutation associated with a type.

本発明は、上述した課題を解決するために、以下の構成を採用する。 The present invention employs the following configuration in order to solve the above-described problems.

すなわち、本発明の一側面に係るゲノム解析装置は、特定の遺伝性の表現型を有する標的生物のゲノム塩基配列及び該遺伝性の表現型を有していない、該標的生物と同種の複数のコントロール生物のゲノム塩基配列をそれぞれ、該標的生物のゲノム塩基配列のリファレンスとなるリファレンスゲノム塩基配列と対比することで変異箇所が特定された状態で取得するゲノム塩基配列取得部と、前記リファレンスゲノム塩基配列との対比により特定された変異から、前記標的生物の遺伝性の表現型の責任変異として指定される複数の候補変異を特定し、該候補変異をそれぞれ少なくとも１つ含む複数の部分ゲノム塩基配列を指定し、該各部分ゲノム塩基配列に含まれる変異の数及び位置情報に基づいて、前記標的生物と前記各コントロール生物の間の該部分ゲノム塩基配列の相違具合を示す相違度及び前記複数のコントロール生物間での該部分ゲノム塩基配列の相違具合を示す相違度を該部分ゲノム塩基配列毎に算出する相違度算出部と、所定の検定方法に基づいて、前記標的生物及び前記各コントロール生物の間における変異箇所の前記相違度と前記複数のコントロール生物間における変異箇所の前記相違度とに有意な差があるか否かについての検定を前記部分ゲノム塩基配列毎に行うことで、前記各部分ゲノム塩基配列に含まれる候補変異の特異性を評価するための評価値として前記検定による統計量を前記部分ゲノム塩基配列毎に算出する候補変異評価部と、前記評価値に基づく前記候補変異の順位を特定可能な状態で前記検定の結果を出力する出力制御部と、を備える。そして、前記相違度算出部は、前記各部分ゲノム塩基配列のサイズを変更して、前記標的生物及び前記各コントロール生物の間における変異箇所の相違度、並びに前記複数のコントロール生物間における変異箇所の相違度を前記部分ゲノム塩基配列毎に再度算出するよう構成され、前記候補変異評価部は、再度算出した前記標的生物及び前記各コントロール生物の間における変異箇所の相違度、並びに前記複数のコントロール生物間における変異箇所の相違度を利用して前記検定を再度行うことで、前記検定による統計量を前記部分ゲノム塩基配列毎に再度算出し、前記サイズの変更毎に算出した前記統計量のうち評価値として採用する前記統計量を前記部分ゲノム塩基配列毎に選択するよう構成される。 That is, the genome analysis apparatus according to one aspect of the present invention includes a plurality of genome base sequences of a target organism having a specific hereditary phenotype and the same species as the target organism not having the hereditary phenotype. A genome base sequence acquisition unit that acquires a control part with a reference genomic base sequence that serves as a reference for the target base genomic base sequence to identify a mutation site, and the reference genomic base A plurality of partial genomic base sequences that identify a plurality of candidate mutations designated as responsible mutations of the inherited phenotype of the target organism from the mutations identified by comparison with the sequence, and each include at least one candidate mutation , And based on the number of mutations and position information contained in each partial genome base sequence, the target organism and each control organism A difference degree calculation unit for calculating a difference degree indicating a difference degree of the partial genome base sequence and a difference degree indicating a difference degree of the partial genome base sequence between the plurality of control organisms for each partial genome base sequence; Based on a predetermined test method, whether there is a significant difference between the degree of difference of the mutation site between the target organism and each of the control organisms and the degree of difference of the mutation site between the plurality of control organisms Is performed for each partial genome base sequence, and as a result an evaluation value for evaluating the specificity of the candidate mutations included in each partial genome base sequence, the statistic based on the test is calculated for each partial genome base sequence. A candidate mutation evaluation unit that outputs the result of the test in a state in which the rank of the candidate mutation based on the evaluation value can be specified. And the difference calculation unit changes the size of each partial genome base sequence, the difference of the mutation site between the target organism and each control organism, and the mutation site between the plurality of control organisms The degree of difference is configured to be calculated again for each partial genome base sequence, and the candidate mutation evaluation unit calculates the difference degree of the mutation site between the target organism and the control organisms calculated again, and the plurality of control organisms. The statistic by the test is recalculated for each partial genome base sequence by performing the test again using the degree of variation between the mutation sites, and the statistic calculated for each change in size is evaluated. The statistic adopted as a value is selected for each partial genome base sequence.

上記構成に係るゲノム解析装置は、特定の遺伝性の表現型を有する標的生物のゲノム塩基配列と、当該遺伝性の表現型を有していない、当該標的生物と同種の複数のコントロール生物のゲノム塩基配列と、を取得する。特定の遺伝性の表現型とは、標的生物に現れる遺伝性の特性であり、例えば、遺伝性の疾患である。各ゲノム塩基配列は、標的生物のゲノム塩基配列のリファレンスとなるリファレンスゲノム塩基配列との対比によって、塩基の変異が生じている箇所（変異箇所）が特定されている。 The genome analysis apparatus according to the above configuration includes a genome base sequence of a target organism having a specific hereditary phenotype and genomes of a plurality of control organisms of the same species as the target organism that do not have the hereditary phenotype. And base sequence. A specific hereditary phenotype is a heritable characteristic that appears in the target organism, for example, a hereditary disease. In each genomic base sequence, a location (mutation location) where a base mutation has occurred is identified by comparison with a reference genomic base sequence that serves as a reference for the genomic base sequence of the target organism.

次に、リファレンスゲノム塩基配列との対比により特定された変異から、標的生物の遺伝性の表現型に係る責任変異の候補となる複数の候補変異を特定し、当該候補変異をそれぞれ少なくとも１つ含む複数の部分ゲノム塩基配列を指定する。そして、この各部分ゲノム塩基配列について、標的生物及び各コントロール生物の間の変異箇所の相違度と複数のコントロール生物間の変異箇所の相違度とを算出する。 Next, a plurality of candidate mutations that are candidates for responsible mutations related to the inherited phenotype of the target organism are identified from the mutations identified by comparison with the reference genome base sequence, and each of the candidate mutations is included. Specify multiple partial genome sequences. And about this partial genome base sequence, the difference degree of the mutation location between a target organism and each control organism and the difference degree of the mutation location between several control organisms are calculated.

更に、例えば、ｔ検定等の所定の検定方法に基づいて、標的生物及び各コントロール生物の間の変異箇所の相違度と複数のコントロール生物間の変異箇所の相違度とに有意な差があるか否かについての検定を部分ゲノム塩基配列毎に実施する。これにより、この部分ゲノム塩基配列毎に実施する検定によって算出される統計量を、各部分ゲノム塩基配列に含まれる候補変異の特異性を評価するための評価値として採用する。そして、この評価値に基づく順位を特定可能な状態で、上記検定の結果を出力する。 Further, for example, based on a predetermined test method such as t-test, is there a significant difference between the degree of variation between the target organism and each control organism and the degree of variation between the plurality of control organisms? The test for whether or not is performed for each partial genome base sequence. Thereby, the statistic calculated by the test performed for each partial genome base sequence is adopted as an evaluation value for evaluating the specificity of candidate mutations included in each partial genome base sequence. Then, the result of the test is output in a state where the rank based on the evaluation value can be specified.

ここで、相違度の算出に係る処理では、各部分ゲノム塩基配列のサイズを変更して、標的生物及び各コントロール生物の間の変異箇所の相違度と複数のコントロール生物間の変異箇所の相違度とを再度算出する。そして、再度算出した標的生物及び各コントロール生物の間の変異箇所の相違度と複数のコントロール生物間の変異箇所の相違度と利用して上記検定を再度実施することで、上記検定による統計量を部分ゲノム塩基配列毎に再計算し、サイズの変更毎に算出した統計量のうち評価値として採用する統計量を部分ゲノム塩基配列毎に選択する。 Here, in the processing for calculating the degree of difference, the size of each partial genome sequence is changed, and the degree of variation between the target organism and each control organism and the degree of variation between the plurality of control organisms. And calculate again. Then, the above-described test is performed again using the difference between the mutation points between the target organism and each control organism calculated again and the difference between the mutation points between the plurality of control organisms. Recalculation is performed for each partial genome base sequence, and a statistic to be adopted as an evaluation value is selected for each partial genome base sequence from among the statistics calculated for each change in size.

したがって、上記構成によれば、各候補変異について、部分ゲノム塩基配列のサイズを調節しつつ、その候補変異が標的生物で特異的に生じているか否かの評価を行うことができる。そのため、遺伝性家系における症例比較及び既知の変異データベース情報に基づく比較参照に寄らなくても、特定の遺伝性の表現型に係る責任変異の同定を精度よく行うことが可能になる。 Therefore, according to the above configuration, for each candidate mutation, it is possible to evaluate whether the candidate mutation specifically occurs in the target organism while adjusting the size of the partial genome base sequence. Therefore, it is possible to accurately identify a responsible mutation related to a specific hereditary phenotype without depending on case comparison in a genetic family and comparison reference based on known mutation database information.

また、上記一側面に係るゲノム解析装置の別の形態として、前記相違度算出部は、前記各部分ゲノム塩基配列に含まれる２つの生物それぞれにおける変異箇所の和集合に含まれる変異の数に対する、両該生物間においていずれかの生物にしか認められない変異箇所の数の割合で定義される変異箇所の相違度を算出してもよい。当該構成によれば、いわゆるハミング距離に基づいた割合で変位箇所の相違度を定義するため、簡易な計算で各候補変異の評価を行うことができる。 Further, as another form of the genome analysis apparatus according to the above aspect, the dissimilarity calculation unit, with respect to the number of mutations included in the union of the mutation sites in each of the two organisms included in each partial genome base sequence, You may calculate the difference degree of the mutation location defined by the ratio of the number of the mutation location which can be recognized only in either organism between both these organisms. According to the said structure, since the difference degree of a displacement location is defined by the ratio based on what is called a Hamming distance, each candidate variation | mutation can be evaluated by simple calculation.

また、上記一側面に係るゲノム解析装置の別の形態として、前記ゲノム配列取得部は、前記標的生物とは非血縁の生物を含む前記複数のコントロール生物のゲノム塩基配列を取得してもよい。当該構成によれば、コントロール生物と標的生物との間に血縁関係がなくてもよいため、コントロール生物の選択の自由度を高めることができる。 As another form of the genome analysis apparatus according to the above aspect, the genome sequence acquisition unit may acquire genome base sequences of the plurality of control organisms including organisms unrelated to the target organism. According to this configuration, since there is no need to have a blood relationship between the control organism and the target organism, the degree of freedom in selecting the control organism can be increased.

また、上記一側面に係るゲノム解析装置の別の形態として、前記相違度算出部は、ゲノム塩基配列上の位置の離間した複数の部分的なゲノム領域を結合することで、前記部分ゲノム塩基配列を指定することを含んでもよい。当該構成によれば、塩基配列上の位置の離間した複数の部分的なゲノム領域を結合することで部分ゲノム塩基配列を指定することが可能になるため、候補変異を評価する範囲となる部分ゲノム塩基配列の指定の自由度を高めることができる。 Further, as another form of the genome analysis apparatus according to the above aspect, the difference calculation unit combines the partial genomic base sequences by combining a plurality of partial genomic regions spaced from each other on the genomic base sequence. May be included. According to this configuration, a partial genome base sequence can be specified by combining a plurality of partial genomic regions that are separated from each other on the base sequence. The degree of freedom in designating the base sequence can be increased.

また、上記一側面に係るゲノム解析装置の別の形態として、前記ゲノム塩基配列取得部は、前記標的生物及び前記コントロール生物のゲノム塩基配列として、全ゲノム領域から少なくとも一部のイントロン領域を取り除いたゲノム塩基配列を取得してもよい。イントロン領域は、遺伝情報を有しない領域である。そのため、遺伝性の表現型の原因を同定する上で、イントロン領域は考慮しなくてもよい。当該構成によれば、全ゲノム領域から少なくとも一部のイントロン領域を取り除くため、本発明のゲノム解析にかかる計算量を低減することができる。 As another form of the genome analysis apparatus according to the above aspect, the genome base sequence acquisition unit removes at least a part of intron regions from the entire genome region as the genome base sequences of the target organism and the control organism. A genomic base sequence may be obtained. The intron region is a region having no genetic information. Therefore, intron regions need not be considered in identifying the cause of the inherited phenotype. According to this configuration, since at least a part of the intron region is removed from the entire genome region, it is possible to reduce the amount of calculation required for the genome analysis of the present invention.

また、上記一側面に係るゲノム解析装置の別の形態として、前記標的生物及び前記複数のコントロール生物はヒトであってもよく、前記特定の遺伝性の表現型は遺伝性の疾患であってもよい。更に、前記候補変異は、前記標的生物のゲノム塩基配列中に検出される同種の生物集団の中で一定の頻度以下でしか検出されない変異であってもよい。当該構成によれば、遺伝性疾患の原因となる責任変異の同定が可能なゲノム解析装置を構成することができる。 As another form of the genome analyzing apparatus according to the above aspect, the target organism and the plurality of control organisms may be humans, and the specific hereditary phenotype may be a hereditary disease. Good. Furthermore, the candidate mutation may be a mutation that is detected only at a certain frequency or less in a homogenous population of organisms detected in the genome base sequence of the target organism. According to this configuration, it is possible to configure a genome analysis apparatus that can identify a responsible mutation that causes a hereditary disease.

なお、上記各形態に係るゲノム解析装置の別の形態として、以上の各構成を実現する情報処理システムであってもよいし、情報処理方法であってもよいし、プログラムであってもよいし、このようなプログラムを記録したコンピュータその他装置、機械等が読み取り可能な記憶媒体であってもよい。ここで、コンピュータ等が読み取り可能な記録媒体とは、プログラム等の情報を、電気的、磁気的、光学的、機械的、又は、化学的作用によって蓄積する媒体である。また、情報処理システムは、１又は複数の情報処理装置によって実現されてもよい。 In addition, as another form of the genome analysis apparatus according to each of the above forms, an information processing system that realizes each of the above configurations, an information processing method, or a program may be used. It may be a storage medium that can be read by a computer, other devices, machines, or the like in which such a program is recorded. Here, the computer-readable recording medium is a medium that stores information such as programs by electrical, magnetic, optical, mechanical, or chemical action. The information processing system may be realized by one or a plurality of information processing devices.

例えば、本発明の一側面に係るゲノム解析方法は、コンピュータが、特定の遺伝性の表現型を有する標的生物のゲノム塩基配列及び該遺伝性の表現型を有していない、該標的生物と同種の複数のコントロール生物のゲノム塩基配列をそれぞれ、該標的生物のゲノム塩基配列のリファレンスとなるリファレンスゲノム塩基配列と対比することで変異箇所が特定された状態で取得するゲノム塩基配列取得ステップと、前記リファレンスゲノム塩基配列との対比により特定された変異から、前記標的生物の遺伝性の表現型の責任変異として指定される複数の候補変異を特定し、該候補変異をそれぞれ少なくとも１つ含む複数の部分ゲノム塩基配列を指定し、該各部分ゲノム塩基配列に含まれる変異の数及び位置情報に基づいて、前記標的生物と前記各コントロール生物の間の該部分ゲノム塩基配列の相違具合を示す相違度及び前記複数のコントロール生物間での該部分ゲノム塩基配列の相違具合を示す相違度を該部分ゲノム塩基配列毎に算出する相違度算出ステップと、所定の検定方法に基づいて、前記標的生物及び前記各コントロール生物の間における変異箇所の前記相違度と前記複数のコントロール生物間における変異箇所の前記相違度とに有意な差があるか否かについての検定を前記部分ゲノム塩基配列毎に行うことで、前記各部分ゲノム塩基配列に含まれる候補変異の特異性を評価するための評価値として前記検定による統計量を前記部分ゲノム塩基配列毎に算出する候補変異評価ステップと、前記評価値に基づく前記候補変異の順位を特定可能な状態で前記検定の結果を出力する出力ステップと、を実行し、前記相違度算出ステップでは、前記各部分ゲノム塩基配列のサイズを変更して、前記標的生物及び前記各コントロール生物の間における変異箇所の相違度、並びに前記複数のコントロール生物間における変異箇所の相違度を前記部分ゲノム塩基配列毎に再度算出し、前記候補変異評価ステップでは、再度算出した前記標的生物及び前記各コントロール生物の間における変異箇所の相違度、並びに前記複数のコントロール生物間における変異箇所の相違度を利用して前記検定を再度行うことで、前記検定による統計量を前記部分ゲノム塩基配列毎に再度算出し、各サイズの変更毎に算出した前記統計量のうち評価値として採用する前記統計量を前記部分ゲノム塩基配列毎に選択する、情報処理方法である。 For example, in the genome analysis method according to one aspect of the present invention, a computer has a genome base sequence of a target organism having a specific hereditary phenotype and the same species as the target organism not having the hereditary phenotype. A genomic base sequence obtaining step of obtaining a plurality of control organism genomic base sequences in a state in which a mutation site is identified by comparing each of the genomic base sequences of a plurality of control organisms with a reference genomic base sequence serving as a reference for the genomic base sequence of the target organism, A plurality of candidate mutations that are specified as a responsible mutation in the inherited phenotype of the target organism are identified from the mutations identified by comparison with a reference genome base sequence, and a plurality of portions each include at least one candidate mutation. A genome base sequence is designated, and based on the number and position information of mutations included in each partial genome base sequence, the target organism and the A degree of difference indicating the degree of difference in the partial genome base sequence between control organisms and a degree of difference indicating the degree of difference in the partial genome base sequence between the plurality of control organisms for each partial genome base sequence Based on the calculation step and a predetermined test method, there is a significant difference between the difference degree of the mutation site between the target organism and each control organism and the difference level of the mutation site between the plurality of control organisms. Is performed for each partial genome base sequence, and the statistical value obtained by the test is used as an evaluation value for evaluating the specificity of candidate mutations included in each partial genome base sequence. Candidate mutation evaluation step calculated for each sequence, and outputs the result of the test in a state where the rank of the candidate mutation based on the evaluation value can be specified An output step, and in the difference degree calculation step, the size of each partial genome base sequence is changed, the difference degree of the mutation site between the target organism and each control organism, and the plurality of controls. The degree of difference between mutation sites between organisms is calculated again for each partial genome base sequence. In the candidate mutation evaluation step, the degree of difference between mutation sites between the target organism and each control organism calculated again, and the plural By performing the test again using the degree of variation of the mutation site between the control organisms, the statistic by the test is recalculated for each partial genome base sequence, and the statistic calculated for each change in size. Is an information processing method for selecting, for each partial genome base sequence, the statistic employed as an evaluation value.

また、例えば、本発明の一側面に係るゲノム解析プログラムは、コンピュータに、特定の遺伝性の表現型を有する標的生物のゲノム塩基配列及び該遺伝性の表現型を有していない、該標的生物と同種の複数のコントロール生物のゲノム塩基配列をそれぞれ、該標的生物のゲノム塩基配列のリファレンスとなるリファレンスゲノム塩基配列と対比することで変異箇所が特定された状態で取得するゲノム塩基配列取得ステップと、前記リファレンスゲノム塩基配列との対比により特定された変異から、前記標的生物の遺伝性の表現型の責任変異として指定される複数の候補変異を特定し、該候補変異をそれぞれ少なくとも１つ含む複数の部分ゲノム塩基配列を指定し、該各部分ゲノム塩基配列に含まれる変異の数及び位置情報に基づいて、前記標的生物と前記各コントロール生物の間の該部分ゲノム塩基配列の相違具合を示す相違度及び前記複数のコントロール生物間での該部分ゲノム塩基配列の相違具合を示す相違度を該部分ゲノム塩基配列毎に算出する相違度算出ステップと、所定の検定方法に基づいて、前記標的生物及び前記各コントロール生物の間における変異箇所の前記相違度と前記複数のコントロール生物間における変異箇所の前記相違度とに有意な差があるか否かについての検定を前記部分ゲノム塩基配列毎に行うことで、前記各部分ゲノム塩基配列に含まれる候補変異の特異性を評価するための評価値として前記検定による統計量を前記部分ゲノム塩基配列毎に算出する候補変異評価ステップと、前記評価値に基づく前記候補変異の順位を特定可能な状態で前記検定の結果を出力する出力ステップと、を実行させ、前記相違度算出ステップでは、前記コンピュータに、前記各部分ゲノム塩基配列のサイズを変更して、前記標的生物及び前記各コントロール生物の間における変異箇所の相違度、並びに前記複数のコントロール生物間における変異箇所の相違度を前記部分ゲノム塩基配列毎に再度算出させ、前記候補変異評価ステップでは、前記コンピュータに、再度算出した前記標的生物及び前記各コントロール生物の間における変異箇所の相違度、並びに前記複数のコントロール生物間における変異箇所の相違度を利用して前記検定を再度行うことで、前記検定による統計量を前記部分ゲノム塩基配列毎に再度算出し、各サイズの変更毎に算出した前記統計量のうち評価値として採用する前記統計量を前記部分ゲノム塩基配列毎に選択させる、プログラムである。 In addition, for example, the genome analysis program according to one aspect of the present invention provides a computer with a genome base sequence of a target organism having a specific hereditary phenotype and the target organism not having the hereditary phenotype. A genome base sequence acquisition step of acquiring a mutation site in a specified state by comparing each of the genome base sequences of a plurality of control organisms of the same type with a reference genome base sequence serving as a reference of the genome base sequence of the target organism, and A plurality of candidate mutations specified as responsible mutations of the inherited phenotype of the target organism are identified from the mutations identified by comparison with the reference genome base sequence, and each of the candidate mutations includes at least one candidate mutation. Based on the number and position information of the mutations contained in each partial genome base sequence. The degree of difference indicating the difference in the partial genome base sequence between the organism and each control organism and the degree of difference indicating the difference in the partial genome base sequence between the plurality of control organisms for each partial genome base sequence Based on a difference degree calculation step to be calculated and a predetermined test method, the difference degree of the mutation site between the target organism and each control organism and the difference level of the mutation site between the plurality of control organisms are significant. By performing the test for whether there is a significant difference for each partial genome base sequence, the statistical value by the test is used as an evaluation value for evaluating the specificity of the candidate mutation included in each partial genome base sequence. A candidate mutation evaluation step for calculating each partial genome base sequence, and the test in a state where the rank of the candidate mutation can be specified based on the evaluation value. An output step of outputting a result, and in the difference calculation step, the computer changes the size of each of the partial genome base sequences to change the mutation site between the target organism and each control organism. The degree of difference and the degree of difference between the plurality of control organisms are calculated again for each partial genome base sequence. In the candidate mutation evaluation step, the computer calculates the target organism and the control organisms calculated again. The statistic by the test is recalculated for each partial genome base sequence by performing the test again using the degree of difference of the mutation site between the control organisms and the difference of the mutation site between the plurality of control organisms. , The statistic to be adopted as an evaluation value out of the statistic calculated for each change in size It is a program that allows selection for each genome base sequence.

本発明によれば、遺伝性家系における症例比較及び既知の変異データベース情報に基づく比較参照に寄らなくても、特定の遺伝性の表現型に係る責任変異の同定を精度よく行うことが可能になる。 According to the present invention, it is possible to accurately identify a responsible mutation related to a specific hereditary phenotype without depending on case comparison in a hereditary family and comparison reference based on known mutation database information. .

図１は、本発明が適用される場面の一例を示す。FIG. 1 shows an example of a scene where the present invention is applied. 図２は、実施の形態に係るゲノム解析装置のハードウェア構成を例示する。FIG. 2 illustrates a hardware configuration of the genome analysis apparatus according to the embodiment. 図３は、実施の形態に係るゲノム解析装置の機能構成を例示する。FIG. 3 illustrates the functional configuration of the genome analysis apparatus according to the embodiment. 図４は、実施の形態に係るゲノム解析に関する処理手順を例示する。FIG. 4 illustrates a processing procedure related to genome analysis according to the embodiment. 図５は、実施の形態に係る部分ゲノム塩基配列における変位箇所の比較方法を例示する。FIG. 5 exemplifies a method for comparing displacement points in the partial genome base sequence according to the embodiment. 図６は、部分ゲノム塩基配列の指定方法の一例を示す。FIG. 6 shows an example of a method for designating a partial genome base sequence. 図７は、患者及び健常者の間における変位箇所の相違度の算定例を示す。FIG. 7 shows a calculation example of the degree of difference between displacement points between a patient and a healthy person. 図８Ａは、肥大型心筋症の疾患原因を同定する対象となる家系を示す。FIG. 8A shows a family that is the target of identifying the cause of the disease of hypertrophic cardiomyopathy. 図８Ｂは、肥大型心筋症の疾患原因を同定する対象となる家系を示す。FIG. 8B shows a family that is the target for identifying the cause of the disease of hypertrophic cardiomyopathy. 図８Ｃは、拘束型心筋症の疾患原因を同定する対象となる家系を示す。FIG. 8C shows a family that is the target for identifying the cause of the disease of restrictive cardiomyopathy. 図９は、実施の形態に係るゲノム解析装置の処理結果を示す。FIG. 9 shows a processing result of the genome analysis apparatus according to the embodiment.

以下、本発明の一側面に係る実施の形態（以下、「本実施形態」とも表記する）を、図面に基づいて説明する。ただし、以下で説明する本実施形態は、あらゆる点において本発明の例示に過ぎない。本発明の範囲を逸脱することなく種々の改良や変形を行うことができることは言うまでもない。つまり、本発明の実施にあたって、実施形態に応じた具体的構成が適宜採用されてもよい。なお、本実施形態において登場するデータを自然言語により説明しているが、より具体的には、コンピュータが認識可能な疑似言語、コマンド、パラメタ、マシン語等で指定される。 Hereinafter, an embodiment according to an aspect of the present invention (hereinafter, also referred to as “this embodiment”) will be described with reference to the drawings. However, this embodiment described below is only an illustration of the present invention in all respects. It goes without saying that various improvements and modifications can be made without departing from the scope of the present invention. That is, in implementing the present invention, a specific configuration according to the embodiment may be adopted as appropriate. Although data appearing in the present embodiment is described in a natural language, more specifically, it is specified by a pseudo language, a command, a parameter, a machine language, or the like that can be recognized by a computer.

§１適用場面
まず、図１を用いて、本発明が適用される場面について説明する。図１は、本実施形態に係るゲノム解析装置１が用いられる場面を例示する。本実施形態に係るゲノム解析装置１は、遺伝性疾患に羅患している患者（case）のゲノム塩基配列と当該遺伝性疾患に羅患していない健常者（control）のゲノム塩基配列とを解析し、遺伝性疾患の原因の候補となる各候補変異の評価付けを行うことで、当該遺伝性疾患の原因となる責任変異を同定可能にする情報処理装置である。 §1 Application scene First, the scene where the present invention is applied will be described with reference to FIG. FIG. 1 illustrates a scene where the genome analysis apparatus 1 according to the present embodiment is used. The genome analysis apparatus 1 according to the present embodiment includes a genome base sequence of a patient (case) suffering from a genetic disease and a genome base sequence of a healthy subject (control) suffering from the genetic disorder. It is an information processing apparatus that can identify a responsible mutation that causes a genetic disease by analyzing and evaluating each candidate mutation that is a candidate for the cause of the genetic disease.

具体的には、本実施形態に係るゲノム解析装置１は、患者及び複数人の健常者それぞれについて、変異箇所の特定されたゲノム塩基配列を取得する。各ゲノム塩基配列における変異箇所は、ｈｇ１９等のリファレンスゲノム塩基配列と各ゲノム塩基配列とを対比することで特定することができる。 Specifically, the genome analysis apparatus 1 according to the present embodiment acquires a genome base sequence in which mutation sites are specified for each of a patient and a plurality of healthy individuals. A mutation site in each genomic base sequence can be identified by comparing a reference genomic base sequence such as hg19 with each genomic base sequence.

続いて、ゲノム解析装置１は、患者のゲノム塩基配列上に生じている変異の中から責任変異の候補となる複数の候補変異を特定し、当該候補変異を少なくとも１つ含む複数の部分ゲノム塩基配列を指定する。そして、ゲノム解析装置１は、指定された各部分ゲノム塩基配列について、患者と健常者との間の変異箇所の相違具合を示す相違度を算出し、算出した相違度を第１グループに格納する。同様に、ゲノム解析装置１は、指定された各部分ゲノム塩基配列について、健常者間の変異箇所の相違具合を示す相違度を算出し、算出した相違度を第２グループに格納する。 Subsequently, the genome analyzing apparatus 1 identifies a plurality of candidate mutations that are candidates for responsible mutations from among the mutations occurring on the patient's genomic base sequence, and a plurality of partial genomic bases including at least one of the candidate mutations Specify an array. And the genome analyzer 1 calculates the difference which shows the difference degree of the variation | mutation location between a patient and a healthy subject about each designated partial genome base sequence, and stores the calculated difference in a 1st group. . Similarly, the genome analysis apparatus 1 calculates a difference indicating the difference between the mutated portions between healthy individuals for each designated partial genome base sequence, and stores the calculated difference in the second group.

更に、ゲノム解析装置１は、ｔ検定等の所定の検定方法に基づいて、第１グループと第２グループとの間に有意な差があるか否かについての検定を部分ゲノム塩基配列毎に行う。この検定処理によって部分ゲノム塩基配列毎に算出される統計量は、患者及び健常者の間の変異箇所の相違具合と健常者間の変異箇所の相違具合とに有意な差があるか否かを示す。そのため、各候補変異が患者に特異的に生じているものであるか否かを評価するための評価値としてこの統計量を利用することができる。 Furthermore, the genome analysis apparatus 1 performs a test for whether there is a significant difference between the first group and the second group for each partial genome base sequence based on a predetermined test method such as a t test. . The statistic calculated for each partial genome base sequence by this test process is whether or not there is a significant difference between the difference in variation between patients and healthy individuals and the difference in variation between healthy individuals. Show. Therefore, this statistic can be used as an evaluation value for evaluating whether or not each candidate mutation is specifically generated in a patient.

ここで、患者に特異的に生じている候補変異は、患者の羅患している遺伝性の疾患の責任変異である可能性が高いと想定される。そのため、ゲノム解析装置１は、この評価値に基づく各候補変異の順位を特定可能な状態で検定の結果を出力する。これによって、この結果を見たユーザは、患者の羅患している遺伝性の疾患の責任変異である可能性が高い候補変異を特定し、高順位の候補変異から責任変異の同定を行うことが可能になる。 Here, it is assumed that the candidate mutation specifically generated in the patient is highly likely to be a responsible mutation for the inherited disease affected by the patient. Therefore, the genome analysis apparatus 1 outputs the result of the test in a state where the rank of each candidate mutation based on this evaluation value can be specified. As a result, the user who sees this result identifies candidate mutations that are likely to be responsible mutations of hereditary diseases that the patient suffers from, and identifies responsible mutations from the higher-ranking candidate mutations. Is possible.

ただし、各評価値は、各部分ゲノム塩基配列のサイズ、すなわち、各候補変異を評価するための範囲の大きさに依存し得る。各部分ゲノム塩基配列のサイズの設定によっては、各候補変異を正しく評価できない可能性がある。そこで、本実施形態に係るゲノム解析装置１は、相違度の算出に係る処理において、各部分ゲノム塩基配列のサイズを変更して、患者及び健常者の間の変異箇所の相違度と健常者間の変異箇所の相違度とを再計算する。そして、ゲノム解析装置１は、再計算した両相違度を利用して上記検定を再実行することで、上記検定による統計量を部分ゲノム塩基配列毎に再計算し、サイズの変更毎に算出した統計量のうち評価値として採用する統計量を部分ゲノム塩基配列毎に選択可能に構成される。 However, each evaluation value can depend on the size of each partial genome base sequence, that is, the size of the range for evaluating each candidate mutation. Depending on the size setting of each partial genome base sequence, each candidate mutation may not be evaluated correctly. Therefore, the genome analysis apparatus 1 according to the present embodiment changes the size of each partial genomic base sequence in the process related to the calculation of the degree of difference, and the degree of difference between the mutated portion between the patient and the healthy person and between the healthy person Recalculate the degree of difference between the mutation sites. Then, the genome analyzer 1 re-calculates the statistic based on the test for each partial genome base sequence by re-executing the test using the recalculated degree of difference, and calculates each time the size is changed. Among the statistics, the statistics used as the evaluation value can be selected for each partial genome base sequence.

これによって、本実施形態によれば、各候補変異について、部分ゲノム塩基配列のサイズ、すなわち、部分ゲノム塩基配列に含まれる塩基の数を調整しつつ、その候補変異が患者に特異的に生じている変異であるか否かの評価を行うことができる。そのため、遺伝性家系における症例比較及び既知の変異データベース情報に基づく比較参照に寄らなくても、遺伝性疾患の原因となる責任変異の同定を精度よく行うことが可能になる。 Thus, according to the present embodiment, for each candidate mutation, the size of the partial genome base sequence, that is, the number of bases included in the partial genome base sequence is adjusted, and the candidate mutation occurs specifically in the patient. It is possible to evaluate whether or not there is a mutation. Therefore, it is possible to accurately identify a responsible mutation that causes a hereditary disease without depending on case comparison in a genetic family and comparison reference based on known mutation database information.

なお、本実施形態では、上記のように、遺伝性疾患の責任変異を同定するための処理に本発明を適用した例を説明する。「遺伝性の疾患」は本発明の「特定の遺伝性の表現型」の一例に相当する。ここで、「表現型」とは、生物の有する遺伝子の構成が形質として現れた特性を示す。また、「患者」は本発明の「標的生物」の一例に相当し、「健常者」は本発明の「コントロール生物」の一例に相当する。しかしながら、本発明の適用可能な範囲は遺伝性疾患の責任変異を同定するための処理に限られず、本発明は、特定の遺伝性の表現型が生じる原因となる責任変異を同定するための処理に広く適用可能である。 In the present embodiment, as described above, an example in which the present invention is applied to processing for identifying a responsible mutation of a genetic disease will be described. The “hereditary disease” corresponds to an example of the “specific hereditary phenotype” of the present invention. Here, the “phenotype” indicates a characteristic in which the composition of a gene of an organism appears as a trait. The “patient” corresponds to an example of the “target organism” of the present invention, and the “normal subject” corresponds to an example of the “control organism” of the present invention. However, the applicable scope of the present invention is not limited to the process for identifying a responsible mutation of a genetic disease, and the present invention includes a process for identifying a responsible mutation that causes a specific hereditary phenotype to occur. Widely applicable to.

すなわち、本発明の「特定の遺伝性の表現型」として採用可能な特性は、遺伝性の疾患に限られず、実施の形態に応じて適宜選択可能である。「特定の遺伝性の表現型」は、遺伝性の疾患の他、動物又は植物が数世代を超えて維持している表現型から適宜選択することができる。また、本実施形態では、特定の遺伝性の疾患を有している被験生物（標的生物）及び遺伝性の疾患を有していない被験生物（コントロール生物）の種は、ヒト以外の動物であってもよいし、植物であってもよい。本発明の「標的生物」及び「コントロール生物」の種は、実施の形態に応じて適宜選択可能である。ただし、「標的生物」及び「コントロール生物」には同種の生物が設定される。 That is, the characteristics that can be adopted as the “specific hereditary phenotype” of the present invention are not limited to hereditary diseases, and can be appropriately selected according to the embodiment. The “specific hereditary phenotype” can be appropriately selected from phenotypes maintained by animals or plants for several generations in addition to hereditary diseases. In this embodiment, the species of the test organism (target organism) having a specific genetic disease and the test organism (control organism) having no genetic disease are animals other than humans. It may be a plant or a plant. The species of “target organism” and “control organism” of the present invention can be appropriately selected according to the embodiment. However, the same kind of organism is set as the “target organism” and “control organism”.

§２構成例
＜ハードウェア構成＞
次に、図２を用いて、ゲノム解析装置１のハードウェア構成を説明する。図２は、本実施形態に係るゲノム解析装置１のハードウェア構成を例示する。ゲノム解析装置１は、図２に例示されるように、ＣＰＵ、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）等を含む制御部１１、制御部１１で実行するプログラム５等を記憶する記憶部１２、液晶ディスプレイ等の画像の表示を行うための表示装置１３、マウス、キーボード等の入力を行うための入力装置１４、外部装置と接続するための外部インタフェース１５、ネットワークを介して通信を行うための通信インタフェース１６、及び記憶媒体６に記憶されたプログラムを読み込むためのドライブ１７が電気的に接続されたコンピュータである。図２では、通信インタフェース及び外部インタフェースは、それぞれ、「通信Ｉ／Ｆ」及び「外部Ｉ／Ｆ」と記載されている。 §2 Configuration example <Hardware configuration>
Next, the hardware configuration of the genome analysis apparatus 1 will be described with reference to FIG. FIG. 2 illustrates the hardware configuration of the genome analysis apparatus 1 according to this embodiment. As illustrated in FIG. 2, the genome analysis apparatus 1 stores a control unit 11 including a CPU, a RAM (Random Access Memory), a ROM (Read Only Memory), and the like, a program 5 executed by the control unit 11, and the like. Unit 12, a display device 13 for displaying an image, such as a liquid crystal display, an input device 14 for inputting a mouse, a keyboard, etc., an external interface 15 for connecting to an external device, and communicating via a network And a drive 17 for reading a program stored in the storage medium 6 is electrically connected to the computer. In FIG. 2, the communication interface and the external interface are described as “communication I / F” and “external I / F”, respectively.

なお、ゲノム解析装置１の具体的なハードウェア構成に関して、実施形態に応じて、適宜、構成要素の省略、置換、及び追加が可能である。例えば、制御部１１は、複数のプロセッサを含んでもよい。また、表示装置１３及び入力装置１４はタッチパネルディスプレイに置き換えられてもよい。更に、ゲノム解析装置１は、複数の外部インタフェース１５を備えてもよく、複数の外部装置と接続してもよい。 It should be noted that regarding the specific hardware configuration of the genome analyzing apparatus 1, the components can be omitted, replaced, and added as appropriate according to the embodiment. For example, the control unit 11 may include a plurality of processors. The display device 13 and the input device 14 may be replaced with a touch panel display. Furthermore, the genome analysis device 1 may include a plurality of external interfaces 15 and may be connected to a plurality of external devices.

また、記憶部１２に記憶されたプログラム５は、ゲノム解析装置１に後述するゲノム解析に関する各処理を実行させるためのプログラムであり、本発明の「プログラム」に相当する。このプログラム５は記憶媒体６に記録されていてもよい。 The program 5 stored in the storage unit 12 is a program for causing the genome analysis apparatus 1 to execute each process related to genome analysis, which will be described later, and corresponds to a “program” of the present invention. The program 5 may be recorded on the storage medium 6.

記憶媒体６は、コンピュータその他装置、機械等が記録されたプログラム等の情報を読み取り可能なように、当該プログラム等の情報を、電気的、磁気的、光学的、機械的又は化学的作用によって蓄積する媒体である。記憶媒体６は、本発明の「記憶媒体」に相当する。 The storage medium 6 stores information such as a program by an electrical, magnetic, optical, mechanical, or chemical action so that information such as a program recorded by a computer or other device or machine can be read. It is a medium to do. The storage medium 6 corresponds to the “storage medium” of the present invention.

ここで、図２では、記憶媒体６の一例として、ＣＤ（Compact Disk）、ＤＶＤ（Digital Versatile Disk）等のディスク型の記憶媒体が例示されている。しかしながら、記憶媒体６の種類は、ディスク型に限定される訳ではなく、ディスク型以外であってもよい。ディスク型以外の記憶媒体として、例えば、フラッシュメモリ等の半導体メモリを挙げることができる。 Here, in FIG. 2, as an example of the storage medium 6, a disk type storage medium such as a CD (Compact Disk) or a DVD (Digital Versatile Disk) is illustrated. However, the type of the storage medium 6 is not limited to the disk type and may be other than the disk type. Examples of the storage medium other than the disk type include a semiconductor memory such as a flash memory.

また、ゲノム解析装置１として、例えば、提供されるサービス専用に設計された装置の他、ＰＣ（Personal Computer）、タブレット端末等の汎用の装置が用いられてよい。更に、ゲノム解析装置１は、１又は複数のコンピュータにより実装されてもよい。 Further, as the genome analysis device 1, for example, a general-purpose device such as a PC (Personal Computer), a tablet terminal, or the like may be used in addition to a device designed exclusively for the provided service. Furthermore, the genome analysis apparatus 1 may be implemented by one or a plurality of computers.

＜機能構成例＞
次に、図３を用いて、ゲノム解析装置１の機能構成を説明する。図３は、本実施形態に係るゲノム解析装置１の機能構成を例示する。本実施形態では、ゲノム解析装置１の制御部１１は、記憶部１２に記憶されたプログラム５をＲＡＭに展開する。そして、制御部１１は、ＲＡＭに展開されたプログラム５をＣＰＵにより解釈及び実行して、各構成要素を制御する。これにより、ゲノム解析装置１は、ゲノム塩基配列取得部２１、相違度算出部２２、候補変異評価部２３及び出力制御部２４を備えるコンピュータとして機能する。 <Functional configuration example>
Next, the functional configuration of the genome analysis apparatus 1 will be described with reference to FIG. FIG. 3 illustrates a functional configuration of the genome analysis apparatus 1 according to the present embodiment. In the present embodiment, the control unit 11 of the genome analysis apparatus 1 expands the program 5 stored in the storage unit 12 in the RAM. And the control part 11 interprets and runs the program 5 expand | deployed by RAM by CPU, and controls each component. Thereby, the genome analysis apparatus 1 functions as a computer including the genome base sequence acquisition unit 21, the difference calculation unit 22, the candidate mutation evaluation unit 23, and the output control unit 24.

ゲノム塩基配列取得部２１は、特定の遺伝性の表現型を有する標的生物のゲノム塩基配列及び遺伝性の表現型を有していない複数のコントロール生物のゲノム塩基配列をそれぞれ、標的生物のゲノム塩基配列のリファレンスとなるリファレンスゲノム塩基配列と対比することで変異箇所が特定された状態で取得する。本実施形態では、特定の遺伝性の表現型は遺伝性の疾患であり、標的生物及びコントロール生物はヒトである。また、リファレンスゲノム塩基配列に係るリファレンス生物もヒトであり、標的生物及びコントロール生物と同種である。リファレンスゲノム塩基配列は、例えば、ｈｇ１９等の公知のゲノム塩基配列であってもよいし、ユーザによって適宜設定されたゲノム塩基配列であってもよい。 The genome base sequence acquisition unit 21 obtains a genome base sequence of a target organism having a specific hereditary phenotype and a genome base sequence of a plurality of control organisms not having the heritable phenotype, respectively. Obtained in a state where the mutation site is identified by comparing with a reference genome base sequence that serves as a reference of the sequence. In this embodiment, the specific hereditary phenotype is a hereditary disease and the target and control organisms are humans. Further, the reference organism related to the reference genome base sequence is also human and is the same species as the target organism and the control organism. The reference genomic base sequence may be a known genomic base sequence such as hg19, or may be a genomic base sequence appropriately set by the user.

次に、相違度算出部２２は、リファレンスゲノム塩基配列との対比により特定された変異から、標的生物の遺伝性の表現型の責任変異として指定される複数の候補変異を特定し、候補変異をそれぞれ少なくとも１つ含む複数の部分ゲノム塩基配列を指定する。そして、相違度算出部２２は、各部分ゲノム塩基配列に含まれる変異の数及び位置情報に基づいて、標的生物及び各コントロール生物の間の部分ゲノム塩基配列における変位箇所の相違具合を示す相違度と、複数のコントロール生物間での部分ゲノム塩基配列における変位箇所の相違具合を示す相違度と、を部分ゲノム塩基配列毎に算出する。 Next, the degree-of-difference calculation unit 22 identifies a plurality of candidate mutations designated as the responsible mutations of the inherited phenotype of the target organism from the mutations identified by comparison with the reference genome base sequence, and selects candidate mutations. A plurality of partial genome base sequences each including at least one are designated. The degree-of-difference calculation unit 22 then shows the degree of difference indicating the degree of difference in the displacement location in the partial genome base sequence between the target organism and each control organism based on the number of mutations and position information included in each partial genome base sequence. And the degree of difference indicating the degree of difference in the displacement location in the partial genome base sequence between the plurality of control organisms is calculated for each partial genome base sequence.

更に、候補変異評価部２３は、ｔ検定等の所定の検定方法に基づいて、標的生物及び各コントロール生物の間における変位箇所の相違度と複数のコントロール生物間における変異箇所の相違度とに有意な差があるか否かについての検定を部分ゲノム塩基配列毎に行う。これにより、候補変異評価部２３は、各部分ゲノム塩基配列に含まれる候補変異の特異性を評価するための評価値として検定による統計量を前記部分ゲノム塩基配列毎に算出する。 Further, the candidate mutation evaluation unit 23 is significant in the degree of difference between the displacement points between the target organism and each control organism and the degree of difference between the mutation points among the plurality of control organisms based on a predetermined test method such as t-test. A test for whether there is a significant difference is performed for each partial genome base sequence. Thereby, the candidate mutation evaluation unit 23 calculates a statistic based on the test for each partial genome base sequence as an evaluation value for evaluating the specificity of the candidate mutation included in each partial genome base sequence.

そして、出力制御部２４は、評価値に基づく候補変異の順位を特定可能な状態で検定の結果を出力する。評価値に基づく候補変異の順位を特定可能な状態として、出力制御部２４は、評価値の高いもの順に候補変異を整列して表示してもよいし、評価値に基づく順位を各候補変異に対応付けて表示してもよい。表示の形態は適宜設定可能である。 Then, the output control unit 24 outputs the result of the test in a state where the rank of candidate mutations based on the evaluation value can be specified. In a state in which the rank of candidate mutations based on the evaluation value can be specified, the output control unit 24 may display the candidate mutations arranged in order from the highest evaluation value, or the rank based on the evaluation value is assigned to each candidate mutation. You may display correspondingly. The display form can be set as appropriate.

ここで、上記相違度算出部２２は、各部分ゲノム塩基配列のサイズを変更して、標的生物及び各コントロール生物の間における変位箇所の相違度、並びに、複数のコントロール生物間における変異箇所の相違度を部分ゲノム塩基配列毎に再度算出する。そして、上記候補変異評価部２３は、再度算出された標的生物及び各コントロール生物の間における変位箇所の相違度、並びに、複数のコントロール生物間における変異箇所の相違度を利用して上記限定を再度実行することで、検定による統計量を部分ゲノム塩基配列毎に再度算出し、サイズの変更毎に算出した統計量のうち評価値として採用する統計量を部分ゲノム塩基配列毎に選択する。これによって、部分ゲノム塩基配列のサイズを調整しつつ、その候補変異が標的生物に特異的に生じている変異であるか否かの評価を行うことが可能になる。 Here, the difference degree calculation unit 22 changes the size of each partial genome base sequence, the difference degree of the displacement place between the target organism and each control organism, and the difference of the mutation place between the plurality of control organisms. The degree is calculated again for each partial genome sequence. Then, the candidate mutation evaluation unit 23 re-uses the above limitation using the recalculated difference in displacement location between the target organism and each control organism, and the difference in variation location between the plurality of control organisms. By executing, the statistical quantity by the test is calculated again for each partial genome base sequence, and the statistical quantity to be adopted as the evaluation value is selected for each partial genomic base sequence among the statistical quantities calculated for each change in size. Thus, it is possible to evaluate whether or not the candidate mutation is a mutation specifically generated in the target organism while adjusting the size of the partial genome base sequence.

なお、本実施形態では、これらの機能がいずれも汎用のＣＰＵによって実現される例を説明している。しかしながら、これらの機能の一部又は全部が、１又は複数の専用のプロセッサにより実現されてもよい。また、ゲノム解析装置１の機能構成に関して、実施形態に応じて、適宜、機能の省略、置換、及び追加が行われてもよい。各機能に関しては後述する動作例で詳細に説明する。 In the present embodiment, an example in which these functions are realized by a general-purpose CPU has been described. However, some or all of these functions may be realized by one or more dedicated processors. In addition, regarding the functional configuration of the genome analysis apparatus 1, functions may be omitted, replaced, and added as appropriate according to the embodiment. Each function will be described in detail in an operation example described later.

§３動作例
次に、図４〜７を用いて、ゲノム解析装置１の動作例を説明する。図４は、ゲノム解析装置１のゲノム解析に関する処理手順を例示する。なお、以下で説明するゲノム解析に関する処理手順は一例にすぎず、各処理は可能な限り変更されてもよい。また、以下で説明する処理手順について、実施の形態に応じて、適宜、ステップの省略、置換及び追加が可能である。 §3 Operation Example Next, an operation example of the genome analysis apparatus 1 will be described with reference to FIGS. FIG. 4 illustrates a processing procedure related to genome analysis of the genome analysis apparatus 1. In addition, the process procedure regarding the genome analysis demonstrated below is only an example, and each process may be changed as much as possible. Further, in the processing procedure described below, steps can be omitted, replaced, and added as appropriate according to the embodiment.

（ステップＳ１０１）
ステップＳ１０１では、制御部１１は、ゲノム塩基配列取得部２１として機能し、患者及び複数人の健常者それぞれについて、変異箇所の特定されたゲノム塩基配列を取得する。各ゲノム塩基配列を取得すると、制御部１１は、次のステップＳ１０２に処理を進める。 (Step S101)
In step S101, the control unit 11 functions as the genomic base sequence acquisition unit 21, and acquires the genomic base sequence in which the mutation site is specified for each of the patient and a plurality of healthy persons. When acquiring each genome base sequence, the control unit 11 advances the processing to the next step S102.

患者（case）は遺伝性の疾患を患っている者であり、健常者（control）は患者の患っている遺伝性の疾患を患っていない者である。健常者は、患者の血縁者であってもよいし、非血縁者であってもよい。患者及び健常者のゲノム塩基配列は、例えば、HiSeq／MiSeq(Illumina社)、Genome Sequencer FLX＋System(Roche社)等の次世代シーケンサーにより解析可能である。このゲノム塩基配列には、ゲノムを構成する塩基の情報と各塩基の配列上の位置（塩基番号）を示す位置情報とが含まれている。 A patient is a person who suffers from a hereditary disease, and a healthy person (control) is a person who does not suffer from the hereditary disease that the patient suffers from. The healthy person may be a patient's relative or an unrelated person. The genomic base sequences of patients and healthy individuals can be analyzed by next-generation sequencers such as HiSeq / MiSeq (Illumina) and Genome Sequencer FLX + System (Roche). This genome base sequence includes information on bases constituting the genome and position information indicating positions (base numbers) on the base sequences.

制御部１１は、例えば、次世代シーケンサー等により解析された患者及び健常者それぞれのゲノム塩基配列を取得する。健常者に関して、制御部１１は、複数人分のゲノム塩基配列を取得する。なお、各ゲノム塩基配列は、ゲノム解析装置１内の記憶部１２又はドライブ１７に取り込まれる記憶媒体から取得されてもよいし、ネットワークを介して他の情報処理装置から取得されてもよい。 For example, the control unit 11 acquires the genome base sequences of the patient and the healthy person analyzed by a next-generation sequencer or the like. For healthy persons, the control unit 11 acquires genome base sequences for a plurality of people. Each genome base sequence may be acquired from a storage medium incorporated in the storage unit 12 or the drive 17 in the genome analyzing apparatus 1 or may be acquired from another information processing apparatus via a network.

また、制御部１１は、ヒトのゲノム塩基配列のリファレンスとなるリファレンスゲノム塩基配列を取得する。なお、各ゲノム塩基配列と同様に、リファレンスゲノム塩基配列は、ゲノム解析装置１内の記憶部１２又はドライブ１７に取り込まれる記憶媒体から取得されてもよいし、ネットワークを介して他の情報処理装置から取得されてもよい。このリファレンスゲノム塩基配列は、例えば、ｈｇ１９等の公知のゲノム塩基配列であってもよいし、被験者由来のゲノム塩基配列でなければユーザによって適宜設定されたゲノム塩基配列であってもよい。 In addition, the control unit 11 acquires a reference genome base sequence that serves as a reference for the human genome base sequence. As with each genomic base sequence, the reference genomic base sequence may be acquired from a storage medium incorporated in the storage unit 12 or the drive 17 in the genome analyzer 1 or other information processing apparatus via a network. May be obtained from This reference genomic base sequence may be, for example, a known genomic base sequence such as hg19, or may be a genomic base sequence appropriately set by the user if it is not a subject-derived genomic base sequence.

そして、制御部１１は、患者及び健常者の各ゲノム塩基配列とリファレンスゲノム塩基配列とを対比し、各ゲノム塩基配列内で一塩基変異（SNV；Single Nucleotide Variant）の生じている箇所（変異箇所）を特定する。具体的には、制御部１１は、各ゲノム塩基配列とリファレンスゲノム塩基配列との間で、同一の位置（塩基番号）に配置された塩基が一致しているか否かを判定する。そして、同一の位置（塩基番号）に配置された塩基が一致していないと判定した場合には、制御部１１は、一致していないと判定した当該位置（塩基番号）を一塩基変異（SNV；Single Nucleotide Variant）の生じている変異箇所と認定する。一方、同一の位置（塩基番号）に配置された塩基が一致する場合には、制御部１１は、当該位置（塩基番号）を変異の生じていない箇所と認定する。これによって、制御部１１は、患者及び複数人の健常者それぞれについて、変異箇所の特定されたゲノム塩基配列を取得することができる。 And the control part 11 compares each genome base sequence of a patient and a healthy subject, and a reference genome base sequence, and the location (mutation location) where the single nucleotide variation (SNV; Single Nucleotide Variant) has occurred in each genome base sequence ). Specifically, the control unit 11 determines whether or not the bases arranged at the same position (base number) match between each genomic base sequence and the reference genomic base sequence. And when it determines with the base arrange | positioned at the same position (base number) not matching, the control part 11 carries out the single base mutation (SNV) of the said position (base number) determined not to correspond. ; Single Nucleotide Variant) is recognized as a mutation site. On the other hand, when the bases arranged at the same position (base number) match, the control unit 11 recognizes the position (base number) as a place where no mutation has occurred. Thereby, the control part 11 can acquire the genome base sequence by which the variation | mutation location was identified about each of a patient and several healthy persons.

なお、リファレンスゲノム塩基配列との対比処理は他の情報処理装置で実行されてもよい。すなわち、制御部１１は、リファレンスゲノム塩基配列との対比処理を省略し、予め変異箇所の特定されたゲノム塩基配列を患者及び複数人の健常者それぞれについて取得してもよい。 Note that the comparison process with the reference genome base sequence may be executed by another information processing apparatus. That is, the control unit 11 may omit the comparison process with the reference genome base sequence, and acquire the genomic base sequence in which the mutation site has been specified in advance for each of the patient and a plurality of healthy persons.

また、全ゲノム領域は、遺伝子間領域及び遺伝子内領域（イントロン領域、エクソン領域）を含んでいる。このうち、イントロン領域は、遺伝情報をコードしない領域であり、遺伝性疾患の原因となる責任変異を含んでいる可能性は低いため、当該責任変異を同定する上では除外してもよい。同様に、遺伝子間領域も、遺伝性疾患の原因となる責任変異を含んでいる可能性は低いため、当該責任変異を同定する上では除外してもよい。 The whole genome region includes an intergenic region and an intragenic region (intron region, exon region). Of these, the intron region is a region that does not encode genetic information, and since it is unlikely to contain a responsible mutation causing a genetic disease, it may be excluded in identifying the responsible mutation. Similarly, the intergenic region is also unlikely to contain a responsible mutation that causes a hereditary disease, and thus may be excluded in identifying the responsible mutation.

そのため、本ステップＳ１０１で取得される各ゲノム塩基配列は、全ゲノム領域に対応していなくてもよく、全ゲノム領域から少なくとも一部のイントロン領域及び／又は遺伝子間領域が取り除かれたものであってもよい。更には、取得される各ゲノム塩基配列は、全ゲノム領域から全てのイントロン領域と遺伝子間領域とを取り除いた全エクソン領域に対応するものであってもよい。これによって、解析対象とする各ゲノム塩基配列のサイズを抑えることができるため、本動作例における一連の処理にかかる計算量を低減することができる。 Therefore, each genome base sequence acquired in this step S101 does not have to correspond to the entire genome region, and is obtained by removing at least a part of the intron region and / or the intergenic region from the entire genome region. May be. Furthermore, each obtained genome base sequence may correspond to the entire exon region obtained by removing all intron regions and intergenic regions from the entire genome region. As a result, the size of each genome base sequence to be analyzed can be suppressed, and the amount of calculation required for a series of processing in this operation example can be reduced.

なお、全エクソン領域には、各エクソン領域の近傍領域を含んでもよい。近傍領域は、例えば、各エクソン領域の端から数〜数十個の塩基を含む領域である。また、上記リファレンスゲノム塩基配列の一例であるｈｇ１９は、全ゲノム領域に対応している。そのため、リファレンスゲノム塩基配列としてｈｇ１９を利用し、本ステップＳ１０１においてイントロン領域等を除外する場合には、制御部１１は、ｈｇ１９からもイントロン領域等を除外する。 Note that the entire exon region may include a region near each exon region. The neighboring region is a region including several to several tens of bases from the end of each exon region, for example. Further, hg19, which is an example of the reference genome base sequence, corresponds to the entire genome region. Therefore, when hg19 is used as a reference genome base sequence and an intron region or the like is excluded in step S101, the control unit 11 excludes the intron region or the like from hg19.

（ステップＳ１０２）
次のステップＳ１０２では、制御部１１は、相違度算出部２２として機能し、各候補変異を含む各部分ゲノム塩基配列のサイズを指定し、２つの個体間（患者及び健常者の間又は健常者及び他の健常者の間）において各部分ゲノム塩基配列内で生じる変異箇所の相違度を部分ゲノム塩基配列毎に算出する。各相違度を算出し終えると、制御部１１は、次のステップＳ１０３に処理を進める。 (Step S102)
In the next step S102, the control unit 11 functions as the dissimilarity calculation unit 22, specifies the size of each partial genomic base sequence including each candidate mutation, and specifies between two individuals (between a patient and a healthy person or a healthy person). And the degree of difference between mutation sites occurring in each partial genome base sequence in each other). When the calculation of each degree of difference is completed, the control unit 11 advances the processing to the next step S103.

図５は、２つの個体間（患者及び健常者の間又は健常者及び他の健常者の間）において各部分ゲノム塩基配列内で生じる変異箇所の比較場面を例示する。 FIG. 5 exemplifies a comparison scene of mutation sites occurring in each partial genome base sequence between two individuals (between a patient and a healthy person or between a healthy person and another healthy person).

まず、制御部１１は、患者のゲノム塩基配列中に検出される変異の中から候補変異を特定する。図５に例示されるように、候補変異は、遺伝性疾患の責任変異の候補として、患者のゲノム塩基配列中に検出される一塩基変異の中から指定される。候補変異は適宜指定することが可能であり、例えば、患者のゲノム塩基配列中に検出されるヒトの生物集団の中で一定の頻度以下でしか検出されない希少変異を候補変異として採用してもよい。 First, the control unit 11 identifies a candidate mutation from among mutations detected in the patient's genomic base sequence. As illustrated in FIG. 5, candidate mutations are designated from among single nucleotide mutations detected in a patient's genomic base sequence as candidates for inherited disease responsible mutations. Candidate mutations can be designated as appropriate. For example, rare mutations that are detected only within a certain frequency or less in a human biological population detected in a patient's genomic base sequence may be adopted as candidate mutations. .

なお、このような希少変異は次のようにして特定することができる。すなわち、リファレンスゲノム塩基配列との比較により得られた一塩基変異のうち、
１）配列のクオリティ値が所定値以下である変異
２）ストランドバイアスに含まれる変異
３）コドンの変更を生じさせない変異
４）健常人で認められる公知の一塩基多型（ＳＮＰｓ）
を除外することで、希少変異を特定することができる。一般的に、リファレンスゲノム塩基配列との比較によって患者のゲノム塩基配列からは１５０万箇所程度の一塩基変異が特定される。一方、上記１）〜４）を除外することによって、患者のゲノム塩基配列中に認められる変異を６００箇所程度の希少変異に限定することができる。 Such rare mutations can be identified as follows. That is, among the single nucleotide mutations obtained by comparison with the reference genome base sequence,
1) Mutation whose sequence quality value is below a predetermined value 2) Mutation included in strand bias 3) Mutation that does not cause codon changes 4) Known single nucleotide polymorphisms (SNPs) found in healthy individuals
By excluding, rare mutations can be identified. Generally, about 1.5 million single base mutations are identified from a patient's genomic base sequence by comparison with a reference genomic base sequence. On the other hand, by excluding the above 1) to 4), mutations found in the patient's genomic base sequence can be limited to about 600 rare mutations.

なお、ストランドバイアスとは、ペアエンド解析において生じる、片鎖のみのシーケンスタグが存在する状態を示す。ペアエンド解析は、Illumina社の提供する配列解析法であり、ある特定の長さの部分配列の両端のみをシーケンスする解析方法である。シーケンスされた配列（タグ）は、通常、リファレンスゲノム塩基配列を参照しながら配列の相同性の高い部分にマップされる。しかしながら、偶然又は何らかの理由により、シーケンスされたタグが両端の配列のうちの片方のみしかマップされない場合がある。つまり、このようなシーケンスされたタグが両端の配列のうちの片方のみしかマップされていない領域、換言すると、ストランドバイアスのある領域では、何らかのエラーが発生している可能性がある。そのため、上記２）では、そのような可能性を排除すべく、制御部１１は、ストランドバイアスのある領域に含まれる変異を除外する。 The strand bias indicates a state in which a sequence tag of only one strand exists in the pair end analysis. The paired-end analysis is a sequence analysis method provided by Illumina, in which only both ends of a partial sequence having a specific length are sequenced. The sequenced sequence (tag) is usually mapped to a highly homologous portion of the sequence with reference to the reference genome base sequence. However, for some reason or for some reason, a sequenced tag may only be mapped to one of the sequences at both ends. That is, there is a possibility that some kind of error has occurred in a region where only one of the sequenced tags is mapped, that is, a region having a strand bias. Therefore, in the above 2), in order to exclude such a possibility, the control unit 11 excludes the mutation included in the region having the strand bias.

次に、制御部１１は、患者のゲノム塩基配列においてそれぞれ候補変異を少なくとも１つ含む複数の部分ゲノム塩基配列を指定する。例えば、図５に例示されるように、制御部１１は、各候補変異が各部分ゲノム塩基配列の中央に位置し、塩基を３０Ｋ（３００００）個含むように、各部分塩基配列の範囲を指定する。ゲノム塩基配列は各塩基の位置情報を保持しているため、各部分ゲノム塩基配列の範囲は、各部分ゲノム塩基配列の両端となる塩基の位置情報によって指定可能である。そして、制御部１１は、各健常者のゲノム塩基配列において、患者のゲノム塩基配列内で指定した各部分ゲノム塩基配列と同一位置及び範囲の領域を各部分ゲノム塩基配列として指定する。 Next, the control unit 11 designates a plurality of partial genomic base sequences each including at least one candidate mutation in the patient's genomic base sequence. For example, as illustrated in FIG. 5, the control unit 11 specifies the range of each partial base sequence so that each candidate mutation is located in the center of each partial genomic base sequence and includes 30K (30000) bases. To do. Since the genome base sequence holds the position information of each base, the range of each partial genome base sequence can be specified by the position information of the bases at both ends of each partial genome base sequence. And the control part 11 designates the area | region of the same position and range as each partial genome base sequence designated in the patient's genome base sequence as each partial genome base sequence in the genome base sequence of each healthy subject.

なお、部分ゲノム塩基配列のサイズ、含まれる候補変異の数、含まれる候補変異の位置及び範囲の指定方法は実施の形態に応じて適宜選択可能である。例えば、各部分ゲノム塩基配列における候補変異の位置は、配列の中央に限られなくてもよく、配列の端等であってもよい。 The size of the partial genome base sequence, the number of candidate mutations included, and the method for specifying the position and range of the candidate mutations included can be selected as appropriate according to the embodiment. For example, the position of the candidate mutation in each partial genome sequence may not be limited to the center of the sequence, but may be the end of the sequence.

また、図５では、制御部１１は、各部分塩基配列として連続する１つの領域を指定している。しかしながら、各部分塩基配列に含まれるゲノム領域の指定方法はこのような例に限定されなくてもよく、制御部１１は、図６に例示されるように、ゲノム塩基配列上の位置の離間した複数の部分的なゲノム領域を結合することで部分ゲノム塩基配列を指定してもよい。 In FIG. 5, the control unit 11 designates one continuous region as each partial base sequence. However, the method of designating the genomic region included in each partial base sequence need not be limited to such an example, and the control unit 11 is spaced from the position on the genomic base sequence as illustrated in FIG. A partial genome base sequence may be designated by combining a plurality of partial genome regions.

図６は、ゲノム塩基配列上の位置の離間した複数の部分的なゲノム領域を結合することで部分ゲノム塩基配列を指定する場面の一例を示す。図６では、ゲノム塩基配列上で離間している３つの部分的なゲノム領域（３１、３２、３３）が存在し、領域３２に候補変異が含まれている。この場合、例えば、制御部１１は、領域３１及び領域３２の間並びに領域３２及び領域３３の間に含まれる塩基を省略することで、３つの領域（３１、３２、３３）を結合し、これによって、１つの部分ゲノム塩基配列を指定することができる。 FIG. 6 shows an example of a scene in which a partial genome base sequence is designated by combining a plurality of partial genomic regions that are spaced apart from each other on the genomic base sequence. In FIG. 6, there are three partial genomic regions (31, 32, 33) that are separated on the genomic base sequence, and the region 32 includes candidate mutations. In this case, for example, the control unit 11 combines the three regions (31, 32, 33) by omitting the bases included between the region 31 and the region 32 and between the region 32 and the region 33. By this, one partial genome base sequence can be designated.

この領域の指定方法によれば、ゲノム塩基配列上の位置の離れた領域を１つの部分ゲノム塩基配列に指定することが可能になる。そのため、候補変異を評価する範囲となる部分ゲノム塩基配列の指定の自由度を高めることができる。 According to this region designating method, it is possible to designate a region far from the position on the genome base sequence as one partial genome base sequence. Therefore, it is possible to increase the degree of freedom in designating a partial genome base sequence that is a range for evaluating candidate mutations.

次に、制御部１１は、図５に例示されるように、各部分ゲノム塩基配列に含まれる変異の数及び位置情報に基づいて、患者及び各健常者の間における変異箇所の相違具合を示す相違度と健常者間における変異箇所の相違具合を示す相違度とを算出する。この相違度は、例えば、相関係数等、互いに対応する２つの部分ゲノム塩基配列における変異箇所の相違具合を示すことのできる指標であれば、実施の形態に応じて適宜設定されてもよい。 Next, as illustrated in FIG. 5, the control unit 11 indicates the degree of variation between the patient and each healthy person based on the number of mutations and position information included in each partial genome base sequence. The degree of difference and the degree of difference indicating the degree of variation of the mutation location between healthy individuals are calculated. The degree of difference may be appropriately set according to the embodiment as long as it is an index that can indicate the degree of difference between the mutation sites in the two partial genome base sequences corresponding to each other, such as a correlation coefficient.

例えば、各部分ゲノム塩基配列に含まれる２つの生物それぞれにおける変異箇所の和集合に含まれる変異の総数に対する、両生物間においていずれかの生物にしか認められない変異箇所の数の割合で変異箇所の相違度を定義してもよい。図７を更に用いて、このように定義される相違度について説明する。 For example, the number of mutation sites is the ratio of the number of mutation sites that can only be found in either organism between the two organisms with respect to the total number of mutations included in the union of mutation sites in each of the two organisms included in each partial genome base sequence The degree of difference may be defined. The degree of difference defined in this way will be described further with reference to FIG.

図７は、２つの生物間の部分ゲノム塩基配列に含まれる変異箇所の相違度を算出する場面を例示する。図７で例示される場面では、患者の部分ゲノム塩基配列には、候補変異を含む５つの変異が含まれている。一方、当該部分ゲノム塩基配列に対応する健常者の部分ゲノム塩基配列にも５つの変異が含まれている。そして、両部分ゲノム塩基配列には、位置が同じ共通の変異が２つ含まれている。両部分ゲノム塩基配列に含まれる変異箇所が一致しているか否かは変異の位置情報を参照することで判定することができる。 FIG. 7 illustrates a scene in which the degree of difference between mutation sites included in a partial genome base sequence between two organisms is calculated. In the scene illustrated in FIG. 7, the patient's partial genome base sequence includes five mutations including candidate mutations. On the other hand, the partial genome base sequence of a healthy subject corresponding to the partial genome base sequence also contains five mutations. Both partial genome base sequences contain two common mutations at the same position. It can be determined by referring to the position information of the mutation whether or not the mutation sites included in both partial genome base sequences are identical.

この場合、各部分ゲノム塩基配列に含まれる２つの生物それぞれにおける変異箇所の和集合に含まれる変異箇所の総数は、両部分ゲノム塩基配列に含まれる合計１０箇所の変異箇所のうち２箇所の変異箇所は互いに共通であるため、８箇所となる。また、両生物間においていずれかの生物にしか認められない変異箇所の数は、両部分ゲノム塩基配列に含まれる合計１０箇所の変異箇所のうち互いに共通する２箇所（合計４箇所）の変異箇所を除くため、６箇所となる。そのため、この定義により与えられる相違度は０．７５となる。 In this case, the total number of mutation sites included in the union of the mutation sites in each of the two organisms included in each partial genome base sequence is 2 mutations out of a total of 10 mutation sites included in both partial genome base sequences. Since the places are common to each other, there are 8 places. In addition, the number of mutation sites that can be found only in either organism between the two organisms is the two mutation sites common to each other (total of 4 sites) among the total of 10 mutation sites included in both partial genome sequences In order to eliminate, there are 6 places. Therefore, the dissimilarity given by this definition is 0.75.

上記のように定義される割合は、２つの部分ゲノム塩基配列間で位置が同じ変異箇所を多く含むほど、両生物間においていずれかの生物にしか認められない変異箇所の数が少なくなるため、値が小さくなる。そのため、上記のように定義される割合によって、互いに対応する２つの部分ゲノム塩基配列における変異箇所の相違具合を示すことができる。なお、上記のように定義される割合（相違度）を以下では「ハミング距離率」とも称する。 As the ratio defined as above includes more mutation sites having the same position between two partial genome sequences, the number of mutation sites that can be found only in either organism between the two organisms decreases. The value becomes smaller. Therefore, it is possible to show the degree of difference between mutation sites in two partial genome base sequences corresponding to each other by the ratio defined as described above. Note that the ratio (degree of difference) defined as above is also referred to as a “hamming distance ratio” below.

この両生物間においていずれかの生物にしか認められない変異箇所の数は、いわゆる２つの集合間のハミング距離に相当するものである。このハミング距離の演算は、例えば、排他的論理和の演算及びカウント演算によって構成可能であり、非常に簡単に行うことができる。そのため、上記のように相違度を定義することによって、２つの生物間における変異箇所の相違具合を非常に簡単な計算によって導出することができる。 The number of mutation sites that can be found only in either organism between these two organisms corresponds to the so-called Hamming distance between the two sets. This Hamming distance calculation can be configured by, for example, exclusive OR calculation and count calculation, and can be performed very easily. Therefore, by defining the degree of difference as described above, it is possible to derive the degree of difference between mutation sites between two organisms by a very simple calculation.

なお、患者の部分ゲノム塩基配列を他の健常者の部分ゲノム塩基配列に置き換えることで、健常者間の変異箇所の相違度の算出方法も同様に説明可能である。制御部１１は、患者と各健常者との間における変異箇所の相違度を部分ゲノム塩基配列毎に算出し、算出した相違度を第１グループに格納する。一方、制御部１１は、２人の健常者間における変異箇所の相違度を部分ゲノム塩基配列毎に算出し、算出した相違度を第２グループに格納する。患者が１人で健常者が４１人の場合、患者と各健常者との間における変異箇所の相違度は、１つの部分ゲノム塩基配列あたり４１個算出される。一方、２人の健常者間における変異箇所の相違度は、１つの部分ゲノム塩基配列あたり８２０個算出される。 It should be noted that the method for calculating the degree of difference between mutations in healthy individuals can be explained in the same way by replacing the partial genomic nucleotide sequence of the patient with the partial genomic nucleotide sequence of another healthy subject. The control part 11 calculates the difference degree of the mutation location between a patient and each healthy person for every partial genome base sequence, and stores the calculated difference degree in a 1st group. On the other hand, the control part 11 calculates the difference degree of the mutation location between two healthy persons for every partial genome base sequence, and stores the calculated difference degree in a 2nd group. When there is one patient and 41 healthy people, 41 differences are calculated for each partial genome base sequence between the patient and each healthy subject. On the other hand, the difference degree of the mutation location between two healthy individuals is calculated for 820 per partial genome base sequence.

（ステップＳ１０３）
次のステップＳ１０３では、制御部１１は、候補変異評価部２３として機能し、所定の検定方法に基づいて、上記第１グループと第２グループとの間に有意な差があるか否かについての検定を部分ゲノム塩基配列毎に行う。また、制御部１１は、各部分ゲノム塩基配列に含まれる候補変異の特異性を評価するための評価値の候補として、当該検定による検定量を部分ゲノム塩基配列毎に算出する。そして、評価値を算出した後に、制御部１１は、次のステップＳ１０４に処理を進める。 (Step S103)
In the next step S103, the control unit 11 functions as the candidate mutation evaluation unit 23, and whether or not there is a significant difference between the first group and the second group based on a predetermined test method. The test is performed for each partial genome sequence. Moreover, the control part 11 calculates the test amount by the said test for every partial genome base sequence as a candidate of the evaluation value for evaluating the specificity of the candidate variation | mutation contained in each partial genome base sequence. Then, after calculating the evaluation value, the control unit 11 advances the processing to the next step S104.

なお、所定の検定方法は、２つのデータ間に差があるか否かを判定可能な検定方法であればよく、実施の形態に応じて適宜選択可能である。例えば、所定の検定方法として、コルモゴロフ・スミルノフ検定、ｔ検定等を採用することができる。コルモゴロフ・スミルノフ検定、ｔ検定等を所定の検定方法に採用した場合、本ステップＳ１０３では、ｔ値及び／又はｐ値がその検定における統計量として算出される。 Note that the predetermined test method may be any test method that can determine whether or not there is a difference between two data, and can be appropriately selected according to the embodiment. For example, Kolmogorov-Smirnov test, t-test, etc. can be adopted as the predetermined test method. When Kolmogorov-Smirnov test, t-test, etc. are adopted as a predetermined test method, in this step S103, t value and / or p value are calculated as statistics in the test.

ここで、ｔ値は、両グループの間の差の度合いを示す数値である。ｔ値の絶対値が大きいほど、両グループの間に差がないとの帰無仮説を棄却できる、すなわち、両グループの間に有意な差があることを示す。そのため、候補変異の特異性を評価するための評価値としてｔ値を用いる場合には、制御部１１は、ｔ値の絶対値が大きいほど当該評価値の示す評価は高いものとして取り扱う。 Here, the t value is a numerical value indicating the degree of difference between the two groups. The larger the absolute value of the t value, the more the null hypothesis that there is no difference between the two groups can be rejected, that is, there is a significant difference between the two groups. Therefore, when the t value is used as an evaluation value for evaluating the specificity of the candidate mutation, the control unit 11 treats the evaluation indicated by the evaluation value as the absolute value of the t value is higher.

また、ｐ値は、両グループの間の差が偶然生じる可能性を示す数値である。ｐ値が０に近いほど、両グループの間に差がないとの帰無仮説を棄却できる、すなわち、両グループの間に有意な差があることを示す。そのため、候補変異の特異性を評価するための評価値としてｐ値を用いる場合には、制御部１１は、ｐ値が０に近いほど当該評価値の示す評価は高いものとして取り扱う。 Moreover, p value is a numerical value which shows the possibility that the difference between both groups will arise accidentally. The closer the p-value is to 0, the more the null hypothesis that there is no difference between the two groups can be rejected, that is, there is a significant difference between the two groups. Therefore, when using the p value as an evaluation value for evaluating the specificity of the candidate mutation, the control unit 11 treats the evaluation indicated by the evaluation value as the p value is closer to 0.

すなわち、評価値の示す評価が高いほど、第１グループと第２グループとの間で有意な差が生じていることを示す。したがって、評価値の示す評価が高いほど、対応する部分ゲノム塩基配列では、患者に特異的な変異が発生していることを示すことができる。 That is, the higher the evaluation indicated by the evaluation value, the more significant difference is generated between the first group and the second group. Therefore, it can be shown that the higher the evaluation indicated by the evaluation value, the patient has a specific mutation in the corresponding partial genome base sequence.

ここで、患者の患っている遺伝性疾患を健常者は患っていないため、患者に特異的に発生している変異がその遺伝性疾患の責任変異である可能性が高い。そのため、上記評価値の示す評価が高いほど、対応する部分ゲノム塩基配列に含まれる候補変異が当該遺伝性疾患の責任変異である可能性が高いことを示すことができる。 Here, since the healthy person does not suffer from the hereditary disease that the patient suffers from, the mutation that occurs specifically in the patient is likely to be a responsible mutation of the hereditary disease. Therefore, it can be shown that the higher the evaluation indicated by the evaluation value is, the higher the possibility that the candidate mutation included in the corresponding partial genome base sequence is a responsible mutation of the hereditary disease.

（ステップＳ１０４）
次のステップＳ１０４では、制御部１１は、相違度算出部２２として機能し、各部分ゲノム塩基配列のサイズを変更し、２つの個体間（患者及び健常者の間又は健常者及び他の健常者の間）における変異箇所の相違度を部分ゲノム塩基配列毎に再計算する。そして、相違度を再計算した後に、制御部１１は、次のステップＳ１０５に処理を進める。各部分ゲノム塩基配列のサイズを変更することを除いて、本ステップＳ１０４の処理は上記ステップＳ１０２の処理と同様である。 (Step S104)
In the next step S104, the control unit 11 functions as the dissimilarity calculation unit 22, changes the size of each partial genome base sequence, and between two individuals (between a patient and a healthy person or between a healthy person and another healthy person). The difference between the mutation sites in (between) is recalculated for each partial genome base sequence. Then, after recalculating the degree of difference, the control unit 11 advances the processing to the next step S105. Except for changing the size of each partial genome base sequence, the process of step S104 is the same as the process of step S102.

例えば、ステップＳ１０２において各部分ゲノム塩基配列のサイズを３０Ｋ個に設定し、１０Ｋ（１００００）個間隔で各部分ゲノム塩基配列のサイズを変動させる場合、制御部１１は、本ステップＳ１０４において、各部分ゲノム塩基配列のサイズを４０Ｋ（４００００）個に再設定する。そして、制御部１１は、４０Ｋ個に再設定した各部分ゲノム塩基配列に基づいて、２つの個体間における変異箇所の相違度を再計算する。 For example, when the size of each partial genome base sequence is set to 30K in step S102 and the size of each partial genome base sequence is changed at an interval of 10K (10000), the control unit 11 Reset the genome sequence size to 40K (40000). And the control part 11 recalculates the difference degree of the variation location between two individuals based on each partial genome base sequence reset to 40K pieces.

なお、本ステップＳ１０４では、各部分ゲノム塩基配列のサイズを大きくするように変更のではなく、各部分ゲノム塩基配列のサイズを小さくするように変更してもよい。各部分ゲノム塩基配列のサイズの変更量は実施の形態に応じて適宜設定可能である。 In step S104, the size of each partial genome base sequence may be changed to be smaller than the size of each partial genome base sequence. The amount of change in the size of each partial genome base sequence can be appropriately set according to the embodiment.

（ステップＳ１０５）
次のステップＳ１０５では、制御部１１は、候補変異評価部２３として機能し、再計算した各相違度を用いて、第１グループと第２グループとの間に有意な差があるか否かについての検定を部分ゲノム塩基配列毎に再度実行する。これにより、制御部１１は、各部分ゲノム塩基配列に対して得られる統計量、すなわち、各候補変異の評価値の候補を再計算する。そして、各候補変異の評価値の候補を再計算した後、制御部１１は、次のステップＳ１０６に処理を進める。なお、ステップＳ１０４で再計算した各相違度を用いる点を除き、本ステップＳ１０５の処理は上記ステップＳ１０３と同様である。そのため、本ステップＳ１０５の説明を省略する。 (Step S105)
In the next step S105, the control unit 11 functions as the candidate mutation evaluation unit 23 and uses each recalculated degree of difference to determine whether there is a significant difference between the first group and the second group. The above test is performed again for each partial genome sequence. Thereby, the control part 11 recalculates the statistic obtained with respect to each partial genome base sequence, ie, the candidate of the evaluation value of each candidate mutation. And after recalculating the candidate of the evaluation value of each candidate mutation, the control part 11 advances a process to following step S106. Note that the processing in step S105 is the same as that in step S103, except that the differences calculated in step S104 are used. Therefore, description of this step S105 is abbreviate | omitted.

（ステップＳ１０６）
次のステップＳ１０６では、制御部１１は、ステップＳ１０４及びステップＳ１０５の処理を繰り返すか否かを判定する。そして、ステップＳ１０４及びステップＳ１０５の処理を繰り返すと判定した場合には、制御部１１は、ステップＳ１０４に処理を戻す。他方、ステップＳ１０４及びステップＳ１０５の処理を繰り返さないと判定した場合には、制御部１１は、ステップＳ１０７に処理を進める。 (Step S106)
In the next step S106, the control unit 11 determines whether to repeat the processes in steps S104 and S105. And when it determines with repeating the process of step S104 and step S105, the control part 11 returns a process to step S104. On the other hand, when it determines with not repeating the process of step S104 and step S105, the control part 11 advances a process to step S107.

例えば、１０Ｋ個間隔で３０Ｋ個〜１００Ｋ（１０００００）個まで各部分ゲノム塩基配列のサイズを変動させて、それぞれの候補変異の評価値を算出する場合には、制御部１１は、ステップＳ１０４及びステップＳ１０５の処理を７回繰り返すことになる。この場合、制御部１１は、ステップＳ１０５において１００Ｋ個のサイズの各部分ゲノム塩基配列について評価値を算出したときには、次のステップＳ１０７に処理を進める。一方、そうではないときには、制御部１１は、ステップＳ１０４に処理を戻す。 For example, when changing the size of each partial genome base sequence from 30K to 100K (100,000) at intervals of 10K and calculating the evaluation value of each candidate mutation, the control unit 11 performs steps S104 and S100. The process of S105 is repeated seven times. In this case, when the control unit 11 calculates an evaluation value for each partial genome base sequence having a size of 100K in step S105, the process proceeds to the next step S107. On the other hand, when that is not right, the control part 11 returns a process to step S104.

なお、繰り返す回数は、実施の形態に応じて適宜設定可能である。また、部分ゲノム塩基配列毎にステップＳ１０４及びステップＳ１０５の処理を繰り返す回数が異なってもよい。更に、ステップＳ１０５において繰り返し算出される統計量が所定の値を超えて変動しない等の傾向が現れた場合に、制御部１１は、ステップＳ１０４及びステップＳ１０５の処理を繰り返さないと判定してもよい。ステップＳ１０４及びステップＳ１０５の処理を繰り返す条件は実施の形態に応じて適宜設定可能である。 Note that the number of repetitions can be set as appropriate according to the embodiment. Moreover, the number of times of repeating the processes of step S104 and step S105 may be different for each partial genome base sequence. Furthermore, when a tendency that the statistic repeatedly calculated in step S105 does not fluctuate beyond a predetermined value appears, the control unit 11 may determine that the processes in steps S104 and S105 are not repeated. . Conditions for repeating the processes of step S104 and step S105 can be set as appropriate according to the embodiment.

（ステップＳ１０７）
次のステップＳ１０７では、制御部１１は、ステップＳ１０３及びステップＳ１０５において算出した統計量（各候補変異の評価値の候補）のうち、各候補変異の評価値として採用する統計量を候補変異（部分ゲノム塩基配列）毎に選択する。そして、全ての候補変異について評価値を選択し終えた後、制御部１１は、次のステップＳ１０８に処理を進める。 (Step S107)
In the next step S107, the control unit 11 selects the statistic to be adopted as the evaluation value of each candidate mutation among the statistics calculated in step S103 and step S105 (evaluation value candidates for each candidate mutation) Select every genome base sequence). Then, after selecting evaluation values for all candidate mutations, the control unit 11 advances the processing to the next step S108.

なお、各候補変異の評価値を採用する規則は実施の形態に応じて適宜設定可能である。例えば、制御部１１は、算出された複数の統計量のうち最も評価の高いことを示す統計量を各候補変異の評価値に採用してもよい。この場合、各候補変異の評価値としてｔ値を採用するときには、制御部１１は、ステップＳ１０３及びステップＳ１０５で算出したｔ値のうち最も絶対値の大きいｔ値を各候補変異の評価値に採用する。他方、各候補変異の評価値としてｐ値を採用するときには、制御部１１は、ステップＳ１０３及びステップＳ１０５で算出したｐ値のうち最も０に近いｐ値を各候補変異の評価値に採用する。 It should be noted that the rules for employing the evaluation values of the candidate mutations can be set as appropriate according to the embodiment. For example, the control unit 11 may employ a statistic indicating the highest evaluation among the calculated statistics as the evaluation value of each candidate mutation. In this case, when the t value is adopted as the evaluation value of each candidate mutation, the control unit 11 adopts the t value having the largest absolute value among the t values calculated in step S103 and step S105 as the evaluation value of each candidate mutation. To do. On the other hand, when adopting the p value as the evaluation value of each candidate mutation, the control unit 11 employs the p value closest to 0 among the p values calculated in step S103 and step S105 as the evaluation value of each candidate mutation.

（ステップＳ１０８）
次のステップＳ１０８では、制御部１１は、出力制御部２４として機能し、ステップＳ１０７で選択した評価値に基づく候補変異の順位を特定可能な状態で上記検定の結果を出力する。これにより、本動作例に係る処理は終了する。 (Step S108)
In the next step S108, the control unit 11 functions as the output control unit 24, and outputs the result of the test in a state where the rank of candidate mutations based on the evaluation value selected in step S107 can be specified. Thereby, the processing according to this operation example ends.

なお、評価値に基づく各候補変異の順位が特定可能な状態であれば、検定結果の出力態様は実施の形態に応じて適宜設定可能である。例えば、制御部１１は、各候補変数を評価値の昇順又は降順に整列可能な状態で上記検定結果を出力してもよい。また、例えば、制御部１１は、各候補変異と評価値とを単に対応付けた状態で上記検定結果を出力してもよい。 In addition, as long as the rank of each candidate mutation based on the evaluation value can be specified, the output mode of the test result can be appropriately set according to the embodiment. For example, the control unit 11 may output the test result in a state where the candidate variables can be arranged in ascending order or descending order of evaluation values. For example, the control unit 11 may output the test result in a state where each candidate mutation is simply associated with the evaluation value.

また、検定結果の出力方法は、実施の形態に応じて適宜選択可能である。例えば、制御部１１は、表示装置１３（ディスプレイ）上に検定結果を表示することで出力してもよい。また、例えば、ゲノム解析装置１が外部インタフェース１５を介してプリンタ等の印字装置に接続される場合、制御部１１は、当該印字装置を制御し、上記検定結果を紙などに印字することで出力してもよい。 In addition, the output method of the test result can be appropriately selected according to the embodiment. For example, the control unit 11 may output the test result by displaying it on the display device 13 (display). For example, when the genome analysis apparatus 1 is connected to a printing device such as a printer via the external interface 15, the control unit 11 controls the printing device and outputs the test result by printing on paper or the like. May be.

（作用・効果）
以上のように、本実施形態に係るゲノム解析装置１は、患者及び複数人の健常者それぞれについて、変異箇所の特定されたゲノム塩基配列を取得する。次に、ゲノム解析装置１は、患者のゲノム塩基配列上に生じている変異の中から責任変異の候補となる複数の候補変異を特定し、当該候補変異を少なくとも１つ含む複数の部分ゲノム塩基配列を指定する。続いて、ゲノム解析装置１は、指定された各部分ゲノム塩基配列について、患者及び各健常者の間における変異箇所の相違度と２人の健常者間における変異箇所の相違度とを算出する。そして、ゲノム解析装置１は、所定の検定方法に基づいて、患者及び各健常者の間における変異箇所の相違度と２人の健常者間における変異箇所の相違度との間に有意な差があるか否かについて検定し、当該検定で算出された統計量を対応する部分ゲノム塩基配列に含まれる候補変異の評価値として採用する。 (Action / Effect)
As described above, the genome analysis apparatus 1 according to the present embodiment acquires the genome base sequence in which the mutation site is specified for each of the patient and the healthy individual. Next, the genome analysis apparatus 1 identifies a plurality of candidate mutations that are candidates for responsible mutations from among the mutations occurring on the patient's genomic base sequence, and a plurality of partial genomic bases including at least one of the candidate mutations Specify an array. Subsequently, the genome analysis device 1 calculates the difference between the mutation sites between the patient and each healthy person and the difference between the mutation sites between the two healthy persons for each designated partial genome base sequence. Then, the genome analyzer 1 determines that there is a significant difference between the difference between the mutation sites between the patient and each healthy person and the difference between the mutation sites between the two healthy persons based on a predetermined test method. Whether or not there is a test is performed, and a statistic calculated by the test is used as an evaluation value of a candidate mutation included in the corresponding partial genome base sequence.

ここで、各部分ゲノム塩基配列に対して算出される統計量は、患者と健常者との間で変異箇所が相違しているか否かを評価する範囲、すなわち、各部分ゲノム塩基配列のサイズに依存し得る。そのため、各部分ゲノム塩基配列のサイズを固定してしまうと、患者と健常者との間で各部分ゲノム塩基配列に含まれる変異箇所の相違具合を正確に評価できず、各候補変異について得られる評価値に各部分ゲノム塩基配列の状態を正しく反映させることができない可能性がある。これによって、評価値に基づく各候補変異の順位付けが正確ではなくなり、遺伝性疾患の責任変異を同定する精度が低くなってしまう可能性がある。 Here, the statistic calculated for each partial genome base sequence is the range for evaluating whether or not the mutation site is different between the patient and the healthy person, that is, the size of each partial genome base sequence. Can depend. Therefore, if the size of each partial genome base sequence is fixed, the difference between the mutation sites contained in each partial genome base sequence cannot be accurately evaluated between the patient and the healthy subject, and each candidate mutation can be obtained. There is a possibility that the state of each partial genome base sequence cannot be correctly reflected in the evaluation value. As a result, the ranking of each candidate mutation based on the evaluation value is not accurate, and the accuracy of identifying the responsible mutation of the hereditary disease may be lowered.

これに対して、本実施形態に係るゲノム解析装置１は、上記ステップＳ１０４及びステップＳ１０５の処理を繰り返すことで、各部分ゲノム塩基配列のサイズを変更して、各候補変異の評価値の計算を繰り返す。これによって、本実施形態によれば、各候補変異について、部分ゲノム塩基配列に含まれる塩基の数を調整しつつ、その候補変異が患者に特異的に生じている変異であるか否かの評価を行うことができる。そのため、遺伝性家系における症例比較及び既知の変異データベース情報に基づく比較参照に寄らなくても、遺伝性疾患の原因となる責任変異の同定を精度よく行うことが可能になる。 On the other hand, the genome analysis apparatus 1 according to the present embodiment calculates the evaluation value of each candidate mutation by changing the size of each partial genome base sequence by repeating the processing of step S104 and step S105. repeat. Thus, according to this embodiment, for each candidate mutation, the number of bases included in the partial genome base sequence is adjusted, and whether or not the candidate mutation is a mutation that is specifically generated in the patient is evaluated. It can be performed. Therefore, it is possible to accurately identify a responsible mutation that causes a hereditary disease without depending on case comparison in a genetic family and comparison reference based on known mutation database information.

（その他）
以上、本発明の実施の形態を詳細に説明してきたが、前述までの説明はあらゆる点において本発明の例示に過ぎない。本発明の範囲を逸脱することなく種々の改良や変形を行うことができることは言うまでもない。
§４処理例 (Other)
As mentioned above, although embodiment of this invention has been described in detail, the above description is only illustration of this invention in all the points. It goes without saying that various improvements and modifications can be made without departing from the scope of the present invention.
§4 Processing example

以下、本発明の実施例について説明する。ただし、本発明は以下の実施例に限定されない。 Examples of the present invention will be described below. However, the present invention is not limited to the following examples.

上記実施形態に係るゲノム解析装置１は、遺伝性疾患の責任変異の候補となる複数の候補変異の順位づけを行うことで、当該遺伝性疾患の責任変異を同定可能にする。そこで、上記実施形態に対応する実施例を用意して、遺伝性疾患の責任変異として既知の変異が正しく上位にランキング付けされるか否かの実験を行った。 The genome analysis apparatus 1 according to the above embodiment makes it possible to identify a responsible mutation of the genetic disease by ranking a plurality of candidate mutations that are candidates for the responsible mutation of the genetic disease. Therefore, an example corresponding to the above embodiment was prepared, and an experiment was conducted to determine whether or not a known mutation is correctly ranked as a responsible mutation for a hereditary disease.

図８Ａ〜図８Ｃは被験者の家系を示す。各図では、四角印は男性を示し、丸印は女性を示す。また、図８Ａ及び図８Ｂでは、黒で塗りつぶした印は肥大型心筋症を患う者を示し、そうでない印は肥大型心筋症を患っていない者を示す。同様に、図８Ｃでは、黒で塗りつぶした印は拘束型心筋症を患う者を示し、そうでない印は拘束型心筋症を患っていない者を示す。 8A to 8C show the family of the subject. In each figure, a square mark indicates a male and a circle indicates a female. Moreover, in FIG. 8A and FIG. 8B, the mark filled with black shows the person who suffers from hypertrophic cardiomyopathy, and the mark which does not show the person who does not suffer from hypertrophic cardiomyopathy. Similarly, in FIG. 8C, a black-filled mark indicates a person suffering from restrictive cardiomyopathy, and a mark other than that represents a person not suffering from restrictive cardiomyopathy.

図８Ａで示されるファミリーＡの患者（II-2）は５８歳男性であり、患者（III-1）は３２歳女性であり、ファミリーＡの患者（II-2）及び患者（III-1）は、常染色体優性遺伝（Autosomal dominant）の形式で遺伝しうる肥大型心筋症を患っていた。そして、ファミリーＡの患者（II-2）及び患者（III-1）は、肥大型心筋症の原因の一つであるＭＹＬ２遺伝子（rs104894369）においてc.173G>A(p.R58Q)の変異を共有していた。なお、「rs104894369」は、ＮＣＢＩ（National Center for Biotechnology Information）の提供するＳＮＰのデータベースであるｄｂＳＮＰに登録されている変異の登録番号を示す。「c.」はコーディングＤＮＡレベルの記載法であることを示し、「c.173G>A」はＭＹＬ２遺伝子の塩基配列１７３番目のグアニン（Ｇ）がアデニン（Ａ）に変異していることを示す。また、「p.」はタンパク質レベルの記載法であることを示し、「p.R58Q」は５８番目のアルギニン（Ｒ）がグルタミン（Ｑ）に置換していることを示す。以下についても同様である。 The family A patient (II-2) shown in FIG. 8A is a 58 year old male, the patient (III-1) is a 32 year old female, the family A patient (II-2) and the patient (III-1) Suffered from hypertrophic cardiomyopathy that could be inherited in the form of autosomal dominant. And the patient of family A (II-2) and patient (III-1) have a mutation of c.173G> A (p.R58Q) in MYL2 gene (rs104894369) which is one of the causes of hypertrophic cardiomyopathy. I was sharing. “Rs104894369” indicates the registration number of the mutation registered in dbSNP, which is the SNP database provided by NCBI (National Center for Biotechnology Information). “C.” Indicates that the coding DNA level is described, and “c.173G> A” indicates that the 173rd guanine (G) of the MYL2 gene is mutated to adenine (A). . “P.” Indicates that the protein level is described, and “p.R58Q” indicates that the 58th arginine (R) is substituted with glutamine (Q). The same applies to the following.

また、図８Ｂで示されるファミリーＢの患者（III-4）は、３９歳男性であり、常染色体優性遺伝の形式で遺伝しうる肥大型心筋症を患っていた。そして、ファミリーＢの患者（III-4）には、肥大型心筋症の原因の一つであるＭＹＨ７遺伝子（rs3218713）においてc.746G>A(p.R249Q)の変異がみられた。なお、「rs3218713」は上記ｄｂＳＮＰの登録番号を示す。「c.746G>A」はＭＹＨ７遺伝子の塩基配列７４６番目のグアニン（Ｇ）がアデニン（Ａ）に変異していることを示し、「p.R249Q」は２４９番目のアルギニン（Ｒ）がグルタミン（Ｑ）に置換していることを示す。 In addition, the family B patient (III-4) shown in FIG. 8B is a 39-year-old male who suffered from hypertrophic cardiomyopathy that can be inherited in the form of autosomal dominant inheritance. In the family B patient (III-4), a mutation of c.746G> A (p.R249Q) was observed in the MYH7 gene (rs3218713), which is one of the causes of hypertrophic cardiomyopathy. “Rs3218713” indicates the registration number of the dbSNP. “C.746G> A” indicates that guanine (G) at the 746th base sequence of the MYH7 gene is mutated to adenine (A), and “p.R249Q” indicates that the 249th arginine (R) is glutamine ( Q) indicates substitution.

一方、ネガティブコントロールとして、拘束型心筋症と診断された患者であって、当該患者の両親に変異が診られないためこの拘束型心筋症は遺伝性の疾患ではないと判断された患者について本実施例による評価付けの実験を行った。図８Ｃはその実験の対象となったファミリーＣを示す。図８Ｃで示されるファミリーＣの患者（III-2）は、７歳女性であり、トロポニン遺伝子等における新生突然変異（de novo mutation）により発症しうる拘束型心筋症を患っていた。そして、このファミリーＣの患者（III-2）には、拘束型心筋症の原因の一つであると過去に報告されたことのあるＴＮＮＩ３遺伝子においてc.584T>C(p.Ｉ195T)の変異がみられた。なお、「c.584T>C」はＴＮＮＩ３遺伝子の塩基配列５８４番目のチミン（Ｔ）がシトシン（Ｃ）に変異していることを示し、「p.Ｉ195T」は１９５番目のイソロイシン（Ｉ）がスレオニン（Ｔ）に置換していることを示す。 On the other hand, as a negative control, this patient was diagnosed as constrained cardiomyopathy, and the mutation was not diagnosed in the patient's parents. An example evaluation experiment was performed. FIG. 8C shows Family C that was the subject of the experiment. The family C patient (III-2) shown in FIG. 8C was a 7-year-old female who suffered from restrictive cardiomyopathy that could develop due to a de novo mutation in the troponin gene or the like. This family C patient (III-2) has a mutation of c.584T> C (p.I195T) in the TNNI3 gene that has been reported to be one of the causes of restrictive cardiomyopathy. Was seen. “C.584T> C” indicates that the thymine (T) at the 584th nucleotide sequence of the TNNI3 gene is mutated to cytosine (C), and “p.I195T” indicates that the 195th isoleucine (I) is It shows that the threonine (T) is substituted.

実施例に係るゲノム解析装置では、Illumina社のHiseq2000を利用して上記各患者のゲノム塩基配列を解析することで、上記各患者のゲノム塩基配列を取得した。そして、取得した各患者のゲノム塩基配列とｈｇ１９と比較することによって、各患者のゲノム塩基配列上の変異箇所を特定した。また、本実施例では、各患者と同様の方法で、各患者とは血縁関係のない４１名の健常者のゲノム塩基配列をそれぞれ取得し、取得した各健常者のゲノム塩基配列をｈｇ１９と比較することによって、各健常者のゲノム塩基配列上の変異箇所を特定した。 In the genome analysis apparatus according to the example, the genomic base sequence of each patient was obtained by analyzing the genomic base sequence of each patient using Illumina's Hiseq2000. Then, by comparing the obtained genomic base sequence of each patient with hg19, the mutation location on the genomic base sequence of each patient was identified. Further, in this example, in the same manner as for each patient, the genome base sequences of 41 healthy individuals who are not related to each patient are obtained, and the obtained genome base sequences of each healthy subject are compared with hg19. By doing so, the variation | mutation location on the genome base sequence of each healthy subject was identified.

次に、本実施例では、各患者のゲノム塩基配列上で生じている変異のうち上記手法により特定される希少変異を候補変異として設定し、各候補変異を配列の中央に含むように各部分ゲノム塩基配列を指定した。また、各部分ゲノム塩基配列における２つの生物間の変異箇所の相違度には上記のように定義されるハミング距離率を採用し、所定の検定方法にはｔ検定を採用し、各候補変異の評価値には当該ｔ検定により算出されるｐ値を採用した。そして、各部分ゲノム塩基配列のサイズは３０Ｋ個〜１００Ｋ個まで１０Ｋ個間隔で変動させて、部分ゲノム塩基配列毎にｐ値を算出し、算出したｐ値のうち最も値の小さいｐ値を対応する部分ゲノム塩基配列に含まれる候補変異の評価値に採用した。この実施例により、各患者の各候補変異の順位づけを行った結果を図９に示す。 Next, in this example, the rare mutation identified by the above method among the mutations occurring on the genome base sequence of each patient is set as a candidate mutation, and each part is included so that each candidate mutation is included in the center of the sequence. Genome base sequence was specified. In addition, the Hamming distance rate defined as above is adopted as the degree of variation between two organisms in each partial genome sequence, the t-test is adopted as a predetermined test method, and each candidate mutation The p value calculated by the t-test was adopted as the evaluation value. Then, the size of each partial genome base sequence is varied at 10K intervals from 30K to 100K, the p value is calculated for each partial genome base sequence, and the smallest p value among the calculated p values is supported. It was adopted as the evaluation value of candidate mutations contained in the partial genome base sequence. FIG. 9 shows the result of ranking each candidate mutation of each patient according to this example.

図９は、本実施例による処理によって各患者の各候補変異の順位付けを行った結果を示す。なお、ハミング距離率を計算する際に、各部分ゲノム塩基配列に含まれる一塩基変異の数が少ない場合、ハミング距離率の値が不当に大きくなってしまい、エラーが生じてしまう可能性がある。そこで、本実施例では、各部分ゲノム塩基配列に３つ以上の一塩基変異が含まれない場合には、当該部分ゲノム塩基配列に含まれる候補変異の評価を省略した。 FIG. 9 shows the result of ranking each candidate mutation of each patient by the processing according to this example. When calculating the Hamming distance rate, if the number of single nucleotide mutations contained in each partial genome base sequence is small, the Hamming distance rate value may become unreasonably large and an error may occur. . Therefore, in this example, when each partial genome base sequence did not contain three or more single base mutations, evaluation of candidate mutations contained in the partial genome base sequences was omitted.

これにより、ファミリーＡの患者（II-2）に係る処理では、候補変異として評価対象とした６１４個の変異のうち５４２個の変異について評価値が算出された。ファミリーＡの患者（III-1）に係る処理では、候補変異として評価対象とした６４６個の変異のうち５７０個の変異について評価値が算出された。ファミリーＢの患者（III-4）に係る処理では、候補変異として評価対象とした６０６個の変異のうち５３４個の変異について評価値が算出された。そして、ファミリーＣの患者（III-2）に係る処理では、候補変異として評価対象とした６３１個の変異のうち５４７個の変異について評価値が算出された。 Thereby, in the process concerning the patient of family A (II-2), evaluation values were calculated for 542 mutations out of 614 mutations to be evaluated as candidate mutations. In the treatment relating to the family A patient (III-1), evaluation values were calculated for 570 mutations out of 646 mutations to be evaluated as candidate mutations. In the processing related to the family B patient (III-4), evaluation values were calculated for 534 mutations among 606 mutations to be evaluated as candidate mutations. In the process related to the family C patient (III-2), evaluation values were calculated for 547 mutations out of 631 mutations to be evaluated as candidate mutations.

図９に示される順位づけの結果によると、ファミリーＡの患者（II-2）にみられるＭＹＬ２遺伝子（rs104894369）におけるc.173G>A(p.R58Q)の変異の評価値（ｐ値）は０．０２であり、評価値を算出した全ての希少変異のなかで上位から２．２％にあたる１２位にランク付けされた。また、ファミリーＡの患者（III-1）にみられるＭＹＬ２遺伝子（rs104894369）におけるc.173G>A(p.R58Q)の変異の評価値（ｐ値）は０．０２であり、評価値を算出した全ての希少変異のなかで上位から２．８％にあたる１６位にランク付けされた。更に、ファミリーＢの患者（III-4）にみられるＭＹＨ７遺伝子（rs3218713）におけるc.746G>A(p.R249Q)の変異の評価値（ｐ値）は０．０２であり、評価値を算出した全ての希少変異のなかで上位から１．７％にあたる９位にランク付けされた。すなわち、遺伝性疾患の一例である肥大型心筋症において、当該遺伝性疾患の原因となる各変異を希少変異のなかで上位から３％以内にランク付けすることができた。 According to the ranking results shown in FIG. 9, the evaluation value (p value) of the mutation of c.173G> A (p.R58Q) in the MYL2 gene (rs104894369) seen in the family A patient (II-2) is It was 0.02, and was ranked 12th, 2.2% from the top among all the rare mutations whose evaluation values were calculated. In addition, the evaluation value (p value) of c.173G> A (p.R58Q) mutation in the MYL2 gene (rs104894369) found in the family A patient (III-1) is 0.02, and the evaluation value is calculated. Among all the rare mutations, it was ranked 16th, 2.8% from the top. Furthermore, the evaluation value (p value) of the mutation of c.746G> A (p.R249Q) in the MYH7 gene (rs3218713) found in the family B patient (III-4) is 0.02, and the evaluation value is calculated. Among all the rare mutations, it ranked 9th, 1.7% from the top. That is, in hypertrophic cardiomyopathy, which is an example of a genetic disease, each mutation causing the genetic disease could be ranked within 3% from the top among rare mutations.

一方、ファミリーＣの患者（III-2）にみられるＴＮＮＩ３遺伝子におけるc.584T>C(p.l195T)の変異の評価値（ｐ値）は０．３４であり、評価値を算出した全ての希少変異のなかで上位から４２．４％である２３２位にランク付けされた。すなわち、両親に変異が認められないために遺伝性の疾患ではないと推察される拘束型心筋症の原因となる変異は、希少変異のなかで上位にランク付けされないことが示された。したがって、本実施例による実験結果から、希少変異のなかで遺伝子疾患の原因となる変異を精度よく上位にランク付けすることができることが示された。 On the other hand, the evaluation value (p value) of the mutation of c.584T> C (p.l195T) in the TNNI3 gene found in the family C patient (III-2) is 0.34, and all the evaluation values calculated Among the rare mutations, it was ranked 232, which is 42.4% from the top. That is, it was shown that the mutation causing constrained cardiomyopathy, which is presumed not to be a hereditary disease because no mutation was found in the parents, was not ranked high among the rare mutations. Therefore, the experimental results according to the present example showed that mutations causing genetic diseases can be ranked highly accurately among rare mutations.

１…ゲノム解析装置、
５…プログラム、６…記憶媒体、
１１…制御部、１２…記憶部、１３…表示装置、１４…入力装置、
１５…外部インタフェース、１６…通信インタフェース、１７…ドライブ、
２１…ゲノム塩基配列取得部、２２…相違度算出部、
２３…候補変異評価部、２４…出力制御部 1 ... Genome analyzer,
5 ... Program, 6 ... Storage medium,
DESCRIPTION OF SYMBOLS 11 ... Control part, 12 ... Memory | storage part, 13 ... Display apparatus, 14 ... Input device,
15 ... External interface, 16 ... Communication interface, 17 ... Drive,
21 ... Genome base sequence acquisition unit, 22 ... Difference calculation unit,
23 ... Candidate mutation evaluation unit, 24 ... Output control unit

Claims

A genome base sequence of a target organism having a specific hereditary phenotype and a genome base sequence of a plurality of control organisms of the same species as the target organism that do not have the hereditary phenotype, respectively, A genome base sequence acquisition unit that acquires a mutation location by comparing it with a reference genome base sequence that serves as a reference for the base sequence;
A plurality of candidate mutations specified as responsible mutations of the inherited phenotype of the target organism are identified from the mutations identified by comparison with the reference genome base sequence, and a plurality of candidate mutations each including at least one candidate mutation The degree of difference indicating the degree of difference in the partial genome base sequence between the target organism and each control organism based on the number and location information of the mutations included in each partial genomic base sequence by designating the partial genomic base sequence And a difference degree calculation unit for calculating a difference degree indicating the degree of difference in the partial genome base sequence between the plurality of control organisms for each partial genome base sequence,
Based on a predetermined test method, whether there is a significant difference between the degree of difference of the mutation site between the target organism and each of the control organisms and the degree of difference of the mutation site between the plurality of control organisms Is performed for each partial genome base sequence, and as a result an evaluation value for evaluating the specificity of the candidate mutations included in each partial genome base sequence, the statistic based on the test is calculated for each partial genome base sequence. A candidate mutation evaluation unit,
An output control unit that outputs the result of the test in a state where the rank of the candidate mutation based on the evaluation value can be specified;
With
The difference calculation unit changes the size of each partial genomic base sequence to determine the difference between mutation sites between the target organism and each control organism, and the difference between mutation sites between the plurality of control organisms. Is calculated again for each partial genome base sequence,
The candidate mutation evaluation unit re-performs the test using the difference degree of the mutation site between the target organism and the control organisms calculated again, and the difference level of the mutation site between the plurality of control organisms. Then, the statistic based on the test is recalculated for each partial genome base sequence, and the statistic to be used as the evaluation value is selected for each partial genome base sequence among the statistics calculated for each change in the size. ,
Genome analyzer.

The difference calculation unit is a mutation site that can be found only in any organism between the two organisms with respect to the number of mutations included in the union of the mutation sites in each of the two organisms included in each partial genome base sequence. Calculating the degree of difference between mutation sites defined by the ratio of the number of
The genome analysis apparatus according to claim 1.

The genome sequence acquisition unit acquires genome base sequences of the plurality of control organisms including organisms unrelated to the target organism.
The genome analysis apparatus according to claim 1 or 2.

The dissimilarity calculation unit includes specifying the partial genome base sequence by combining a plurality of partial genomic regions spaced apart on the genomic base sequence,
The genome analysis apparatus according to any one of claims 1 to 3.

The genomic base sequence obtaining unit obtains a genomic base sequence obtained by removing at least a part of an intron region from the entire genomic region as a genomic base sequence of the target organism and the control organism.
The genome analysis apparatus according to any one of claims 1 to 4.

The target organism and the plurality of control organisms are humans;
The specific hereditary phenotype is a hereditary disease,
The candidate mutation is a mutation that is detected only at a certain frequency or less in a homogenous population of organisms detected in the genomic base sequence of the target organism.
The genome analysis apparatus according to any one of claims 1 to 5.

Computer
A genome base sequence of a target organism having a specific hereditary phenotype and a genome base sequence of a plurality of control organisms of the same species as the target organism that do not have the hereditary phenotype, respectively, A genome base sequence acquisition step for acquiring a mutation location by comparing with a reference genome base sequence serving as a reference for the base sequence;
A plurality of candidate mutations specified as responsible mutations of the inherited phenotype of the target organism are identified from the mutations identified by comparison with the reference genome base sequence, and a plurality of candidate mutations each including at least one candidate mutation The degree of difference indicating the degree of difference in the partial genome base sequence between the target organism and each control organism based on the number and location information of the mutations included in each partial genomic base sequence by designating the partial genomic base sequence And a difference degree calculating step for calculating a difference degree indicating the degree of difference in the partial genome base sequence between the plurality of control organisms for each partial genome base sequence,
Based on a predetermined test method, whether there is a significant difference between the degree of difference of the mutation site between the target organism and each of the control organisms and the degree of difference of the mutation site between the plurality of control organisms Is performed for each partial genome base sequence, and as a result an evaluation value for evaluating the specificity of the candidate mutations included in each partial genome base sequence, the statistic based on the test is calculated for each partial genome base sequence. A candidate mutation evaluation step,
An output step of outputting the result of the test in a state where the rank of the candidate mutation based on the evaluation value can be specified;
Run
In the difference calculation step, the size of each partial genome base sequence is changed, the difference between the mutation sites between the target organism and each control organism, and the difference between the mutation sites between the plurality of control organisms. Is calculated again for each partial genome base sequence,
In the candidate mutation evaluation step, the test is performed again by using the difference degree of the mutation site between the target organism and the control organisms calculated again, and the difference level of the mutation site between the plurality of control organisms. Then, the statistic based on the test is recalculated for each partial genome base sequence, and the statistic to be used as the evaluation value is selected for each partial genome base sequence among the statistics calculated for each change in size. ,
Genome analysis method.

On the computer,
A genome base sequence of a target organism having a specific hereditary phenotype and a genome base sequence of a plurality of control organisms of the same species as the target organism that do not have the hereditary phenotype, respectively, A genome base sequence acquisition step for acquiring a mutation location by comparing with a reference genome base sequence serving as a reference for the base sequence;
A plurality of candidate mutations specified as responsible mutations of the inherited phenotype of the target organism are identified from the mutations identified by comparison with the reference genome base sequence, and a plurality of candidate mutations each including at least one candidate mutation The degree of difference indicating the degree of difference in the partial genome base sequence between the target organism and each control organism based on the number and location information of the mutations included in each partial genomic base sequence by designating the partial genomic base sequence And a difference degree calculating step for calculating a difference degree indicating the degree of difference in the partial genome base sequence between the plurality of control organisms for each partial genome base sequence,
Based on a predetermined test method, whether there is a significant difference between the degree of difference of the mutation site between the target organism and each of the control organisms and the degree of difference of the mutation site between the plurality of control organisms Is performed for each partial genome base sequence, and as a result an evaluation value for evaluating the specificity of the candidate mutations included in each partial genome base sequence, the statistic based on the test is calculated for each partial genome base sequence. A candidate mutation evaluation step,
An output step of outputting the result of the test in a state where the rank of the candidate mutation based on the evaluation value can be specified;
And execute
In the difference degree calculating step, the computer changes the size of each partial genome base sequence, the degree of difference between the target organism and each control organism, and the variation among the plurality of control organisms. The difference in location is calculated again for each partial genome base sequence,
In the candidate mutation evaluation step, the test is performed on the computer using the difference degree of the mutation site between the target organism and the control organisms calculated again, and the difference level of the mutation site between the plurality of control organisms. , The statistic obtained by the test is recalculated for each partial genome base sequence, and the statistic used as an evaluation value among the statistics calculated for each change in size is used as the partial genome base sequence. Let me choose every
Genome analysis program.