JP2012078880A

JP2012078880A - Genome sequence specification device, genome sequence specification program and genome sequence specification method of genome sequence specification device

Info

Publication number: JP2012078880A
Application number: JP2010220392A
Authority: JP
Inventors: Hajime Oyanagi; 一大柳; Nobukazu Namiki; 信和並木
Original assignee: Mitsubishi Space Software Co Ltd
Current assignee: Mitsubishi Space Software Co Ltd
Priority date: 2010-09-30
Filing date: 2010-09-30
Publication date: 2012-04-19

Abstract

PROBLEM TO BE SOLVED: To make it possible to restore a base sequence of whole genome from a lot of decoded fragment sequences.SOLUTION: A reference mapping part 110 performs reference mapping of fragment sequence data 191 using reference sequence data 192, generates genome provisional sequence data 101 and specifies remaining part of the fragment sequence data 191 as leftover sequence data 102. A gap neighbor sequence extraction part 120 extracts fragment sequence data 191 set before and after the part (a gap) which has not been specified by the reference mapping from the genome provisional sequence data 101. A de novo assembly part 130 performs de novo assembly of the gap neighbor sequence data 103 and the leftover sequence data 102, and generates assembly part sequence data 104. A complete genome restore part 140 sets the assembly part sequence data 104 for the genome provisional sequence data 101 and generates genome sequence data 105.

Description

本発明は、ゲノムの塩基配列を特定するゲノム配列特定装置、ゲノム配列特定プログラムおよびゲノム配列特定装置のゲノム配列特定方法に関するものである。 The present invention relates to a genome sequence specifying device, a genome sequence specifying program, and a genome sequence specifying method of a genome sequence specifying device for specifying a base sequence of a genome.

高等生物のゲノムはおおむね数億から数十億の塩基が連なる塩基配列であるが、現在の技術では一度に１０００程度の長さの塩基配列しか解読することができない。
そこで、ゲノム全体の塩基配列を解読するには、ゲノムを３０〜１０００塩基程度の短い塩基配列に断片化して解読し、解読した大量の断片配列からゲノム全体の塩基配列を復元する必要がある。 The genomes of higher organisms generally have base sequences consisting of hundreds of millions to billions of bases, but current technology can only decode base sequences as long as about 1000 at a time.
Therefore, in order to decode the base sequence of the entire genome, it is necessary to fragment the genome into short base sequences of about 30 to 1000 bases, decode it, and restore the base sequence of the entire genome from the large number of decoded fragment sequences.

ゲノムの復元手法として、近縁種ゲノムをヒントにして復元を試みる「リファレンスマッピング」と、ヒント無しで復元を試みる「デノボアセンブル」が用いられている。 As a method for restoring genomes, “reference mapping” that tries to restore using a related species genome as a hint and “de novo assembly” that tries to restore without a hint are used.

しかし、リファレンスマッピングは、近縁種ゲノムの塩基配列に対応しない部分を復元することができない。
また、デノボアセンブルは、復元できない部分が生じてしまう上、復元できた部分がゲノムのどの部分であるかを特定することができない。さらに、計算量が膨大であるため、処理能力が高い計算機を用意する必要がある。 However, reference mapping cannot restore a portion that does not correspond to the base sequence of a related species genome.
In addition, de novo assembly results in a part that cannot be restored, and it is not possible to specify which part of the genome the restored part is. Furthermore, since the calculation amount is enormous, it is necessary to prepare a computer with high processing capability.

特開２００９−１１６５５９号公報JP 2009-116559 A 特開２００６−０３９８６７号公報JP 2006-039867 A 特開平０７−１１５９５９号公報JP 07-115959 A

本発明は、解読された大量の断片配列からゲノム全体の塩基配列を復元できるようにすることを目的とする。 An object of the present invention is to make it possible to restore the base sequence of the entire genome from a large number of decoded fragment sequences.

本発明のゲノム配列特定装置は、
対象ゲノムの塩基配列の断片を示す複数の断片配列データを入力し、塩基配列が特定された既知ゲノムの塩基配列を示す参照配列データを入力し、複数の断片配列データと前記参照配列データとを比較し、比較結果に基づいて複数の断片配列データを前記参照配列データに対応させて結合したデータをマッピング部分配列データとして生成するリファレンスマッピング部と、
複数の断片配列データから前記リファレンスマッピング部により生成されたマッピング部分配列データに含まれない複数の断片配列データを複数の非マッピング断片データとして抽出する非マッピング断片データ抽出部と、
前記リファレンスマッピング部により生成されたマッピング部分配列データから前記マッピング部分配列データの端部に含まれる断片配列データを端部配列データとして抽出する端部配列データ抽出部と、
前記端部配列データ抽出部により抽出された端部配列データと前記非マッピング断片データ抽出部により抽出された複数の非マッピング断片データとを比較し、比較結果に基づいて前記端部配列データと少なくともいずれかの非マッピング断片データとを一致部分で結合したデータをアセンブル部分配列データとして生成するデノボアセンブル部と、
前記リファレンスマッピング部により生成されたマッピング部分配列データと前記デノボアセンブル部により生成されたアセンブル部分配列データとを前記端部配列データを示す部分で結合したデータを前記対象ゲノムの塩基配列を示すゲノム配列データとして生成するゲノム配列データ生成部とを備える。 The genome sequence identification device of the present invention comprises:
Input a plurality of fragment sequence data indicating fragments of the base sequence of the target genome, input reference sequence data indicating a known genome base sequence whose base sequence is specified, and a plurality of fragment sequence data and the reference sequence data. A reference mapping unit that generates a mapping partial sequence data by combining and combining a plurality of fragment sequence data corresponding to the reference sequence data based on the comparison result;
A non-mapping fragment data extraction unit that extracts a plurality of fragment sequence data not included in the mapping partial sequence data generated by the reference mapping unit from a plurality of fragment sequence data as a plurality of non-mapping fragment data;
An end sequence data extraction unit that extracts, as end sequence data, fragment sequence data included in the end of the mapping partial sequence data from the mapping partial sequence data generated by the reference mapping unit;
The end sequence data extracted by the end sequence data extraction unit is compared with a plurality of non-mapping fragment data extracted by the non-mapping fragment data extraction unit, and at least the end sequence data is compared with the end sequence data based on the comparison result A de novo assembly part that generates data obtained by combining any non-mapping fragment data with a matching part as assembled partial array data;
Genomic sequence indicating the base sequence of the target genome by combining the mapping partial sequence data generated by the reference mapping unit and the assembled partial sequence data generated by the de novo assembly unit at a portion indicating the end sequence data A genome sequence data generation unit that generates data.

前記リファレンスマッピング部は、複数のマッピング部分配列データを生成し、
前記端部配列データ抽出部は、複数のマッピング部分配列データから複数の端部配列データを抽出し、
前記デノボアセンブル部は、複数の端部配列データと複数の非マッピング断片データとを比較し、比較結果に基づいて複数のアセンブル部分配列データを生成し、
前記ゲノム配列データ生成部は、複数のマッピング部分配列データと複数のアセンブル部分配列データとを結合して前記ゲノム配列データを生成する。 The reference mapping unit generates a plurality of mapping partial array data,
The end sequence data extraction unit extracts a plurality of end sequence data from a plurality of mapping partial sequence data,
The de novo assembly part compares a plurality of end part arrangement data and a plurality of non-mapping fragment data, and generates a plurality of assembly partial arrangement data based on the comparison result,
The genome sequence data generation unit combines the plurality of mapping partial sequence data and the plurality of assembly partial sequence data to generate the genome sequence data.

前記端部配列データ抽出部は、前記参照配列データ内でいずれのマッピング部分配列データとも対応しない部分をギャップとして特定し、特定したギャップ毎にギャップ前後のマッピング部分配列データからギャップ側の端部の端部配列データを抽出し、
前記デノボアセンブル部は、複数の端部配列データと複数の非マッピング断片データとを比較し、比較結果に基づいて複数のギャップに対応する複数のアセンブル部分配列データを生成し、
前記ゲノム配列データ生成部は、複数のマッピング部分配列データと複数のアセンブル部分配列データとを結合して前記ゲノム配列データを生成する。 The end sequence data extraction unit identifies a portion that does not correspond to any mapping partial sequence data in the reference sequence data as a gap, and for each identified gap, from the mapping partial sequence data before and after the gap, Extract edge sequence data,
The de novo assembly part compares a plurality of end array data and a plurality of non-mapping fragment data, and generates a plurality of assembled partial array data corresponding to a plurality of gaps based on the comparison result,
The genome sequence data generation unit combines the plurality of mapping partial sequence data and the plurality of assembly partial sequence data to generate the genome sequence data.

前記デノボアセンブル部は、ギャップ毎にギャップ前後の端部配列データと複数の非マッピング断片データとを比較し、比較結果に基づいてアセンブル部分配列データをギャップ毎に生成し、
前記ゲノム配列データ生成部は、ギャップ毎にギャップ前後のマッピング部分配列データとギャップに対応するアセンブル部分配列データとを結合して前記ゲノム配列データを生成する。 The de novo assembly part compares the end arrangement data before and after the gap and a plurality of non-mapping fragment data for each gap, and generates assembly partial arrangement data for each gap based on the comparison result,
The genome sequence data generation unit generates the genome sequence data by combining the mapping partial sequence data before and after the gap and the assembled partial sequence data corresponding to the gap for each gap.

本発明のゲノム配列特定プログラムは、
対象ゲノムの塩基配列の断片を示す複数の断片配列データを入力し、塩基配列が特定された既知ゲノムの塩基配列を示す参照配列データを入力し、複数の断片配列データと前記参照配列データとを比較し、比較結果に基づいて複数の断片配列データを前記参照配列データに対応させて結合したデータをマッピング部分配列データとして生成するリファレンスマッピング処理と、
複数の断片配列データから前記リファレンスマッピング処理により生成されたマッピング部分配列データに含まれない複数の断片配列データを複数の非マッピング断片データとして抽出する非マッピング断片データ抽出処理と、
前記リファレンスマッピング処理により生成されたマッピング部分配列データから前記マッピング部分配列データの端部に含まれる断片配列データを端部配列データとして抽出する端部配列データ抽出処理と、
前記端部配列データ抽出処理により抽出された端部配列データと前記非マッピング断片データ抽出処理により抽出された複数の非マッピング断片データとを比較し、比較結果に基づいて前記端部配列データと少なくともいずれかの非マッピング断片データとを一致部分で結合したデータをアセンブル部分配列データとして生成するデノボアセンブル処理と、
前記リファレンスマッピング処理により生成されたマッピング部分配列データと前記デノボアセンブル処理により生成されたアセンブル部分配列データとを前記端部配列データを示す部分で結合したデータを前記対象ゲノムの塩基配列を示すゲノム配列データとして生成するゲノム配列データ生成処理とをコンピュータに実行させる。 The genome sequence identification program of the present invention includes:
Input a plurality of fragment sequence data indicating fragments of the base sequence of the target genome, input reference sequence data indicating a known genome base sequence whose base sequence is specified, and a plurality of fragment sequence data and the reference sequence data. A reference mapping process for generating, as mapping partial sequence data, a plurality of fragment sequence data corresponding to the reference sequence data and combining them based on the comparison results,
A non-mapping fragment data extraction process for extracting a plurality of fragment sequence data not included in the mapping partial sequence data generated by the reference mapping process from a plurality of fragment sequence data as a plurality of non-mapping fragment data;
End sequence data extraction processing for extracting fragment sequence data included in the end of the mapping partial sequence data from the mapping partial sequence data generated by the reference mapping processing as end sequence data;
The end sequence data extracted by the end sequence data extraction process is compared with a plurality of non-mapping fragment data extracted by the non-mapping fragment data extraction process, and at least the end sequence data is compared with the end sequence data based on the comparison result De novo assembly processing for generating data that combines any non-mapping fragment data with a matching part as assembled partial array data;
A genome sequence indicating the base sequence of the target genome, which is obtained by combining the mapping partial sequence data generated by the reference mapping process and the assembled partial sequence data generated by the de novo assembly process at a portion indicating the end sequence data The computer is caused to execute genome sequence data generation processing to be generated as data.

前記リファレンスマッピング処理は、複数のマッピング部分配列データを生成し、
前記端部配列データ抽出処理は、複数のマッピング部分配列データから複数の端部配列データを抽出し、
前記デノボアセンブル処理は、複数の端部配列データと複数の非マッピング断片データとを比較し、比較結果に基づいて複数のアセンブル部分配列データを生成し、
前記ゲノム配列データ生成処理は、複数のマッピング部分配列データと複数のアセンブル部分配列データとを結合して前記ゲノム配列データを生成する。 The reference mapping process generates a plurality of mapping partial array data,
The end sequence data extraction process extracts a plurality of end sequence data from a plurality of mapping partial sequence data,
The de novo assembly process compares a plurality of end sequence data and a plurality of non-mapping fragment data, generates a plurality of assembled partial sequence data based on the comparison result,
In the genome sequence data generation process, a plurality of mapping partial sequence data and a plurality of assembled partial sequence data are combined to generate the genome sequence data.

前記端部配列データ抽出処理は、前記参照配列データ内でいずれのマッピング部分配列データとも対応しない部分をギャップとして特定し、特定したギャップ毎にギャップ前後のマッピング部分配列データからギャップ側の端部の端部配列データを抽出し、
前記デノボアセンブル処理は、複数の端部配列データと複数の非マッピング断片データとを比較し、比較結果に基づいて複数のギャップに対応する複数のアセンブル部分配列データを生成し、
前記ゲノム配列データ生成処理は、複数のマッピング部分配列データと複数のアセンブル部分配列データとを結合して前記ゲノム配列データを生成する。 The end sequence data extraction processing specifies a portion that does not correspond to any mapping partial sequence data in the reference sequence data as a gap, and for each specified gap, the mapping sequence sequence data before and after the gap Extract edge sequence data,
The de novo assembly process compares a plurality of end sequence data and a plurality of non-mapping fragment data, generates a plurality of assembled partial sequence data corresponding to a plurality of gaps based on the comparison result,
In the genome sequence data generation process, a plurality of mapping partial sequence data and a plurality of assembled partial sequence data are combined to generate the genome sequence data.

前記デノボアセンブル処理は、ギャップ毎にギャップ前後の端部配列データと複数の非マッピング断片データとを比較し、比較結果に基づいてアセンブル部分配列データをギャップ毎に生成し、
前記ゲノム配列データ生成処理は、ギャップ毎にギャップ前後のマッピング部分配列データとギャップに対応するアセンブル部分配列データとを結合して前記ゲノム配列データを生成する。 The de novo assembly process compares the end sequence data before and after the gap and a plurality of non-mapping fragment data for each gap, and generates assembled partial sequence data for each gap based on the comparison result,
In the genome sequence data generation process, the mapping partial sequence data before and after the gap and the assembled partial sequence data corresponding to the gap are combined for each gap to generate the genome sequence data.

本発明のゲノム配列特定方法は、
リファレンスマッピング部が、対象ゲノムの塩基配列の断片を示す複数の断片配列データを入力し、塩基配列が特定された既知ゲノムの塩基配列を示す参照配列データを入力し、複数の断片配列データと前記参照配列データとを比較し、比較結果に基づいて複数の断片配列データを前記参照配列データに対応させて結合したデータをマッピング部分配列データとして生成し、
非マッピング断片データ抽出部が、複数の断片配列データから前記リファレンスマッピング部により生成されたマッピング部分配列データに含まれない複数の断片配列データを複数の非マッピング断片データとして抽出し、
端部配列データ抽出部が、前記リファレンスマッピング部により生成されたマッピング部分配列データから前記マッピング部分配列データの端部に含まれる断片配列データを端部配列データとして抽出し、
デノボアセンブル部が、前記端部配列データ抽出部により抽出された端部配列データと前記非マッピング断片データ抽出部により抽出された複数の非マッピング断片データとを比較し、比較結果に基づいて前記端部配列データと少なくともいずれかの非マッピング断片データとを一致部分で結合したデータをアセンブル部分配列データとして生成し、
ゲノム配列データ生成部が、前記リファレンスマッピング部により生成されたマッピング部分配列データと前記デノボアセンブル部により生成されたアセンブル部分配列データとを前記端部配列データを示す部分で結合したデータを前記対象ゲノムの塩基配列を示すゲノム配列データとして生成する。 The genomic sequence identification method of the present invention comprises:
The reference mapping unit inputs a plurality of fragment sequence data indicating fragments of the base sequence of the target genome, inputs reference sequence data indicating the base sequence of a known genome whose base sequence is specified, a plurality of fragment sequence data and the above-mentioned Compared with reference sequence data, a plurality of fragment sequence data based on the comparison result corresponding to the reference sequence data to generate data as mapping partial sequence data,
A non-mapping fragment data extraction unit extracts a plurality of fragment sequence data not included in the mapping partial sequence data generated by the reference mapping unit from a plurality of fragment sequence data as a plurality of non-mapping fragment data;
An end sequence data extraction unit extracts fragment sequence data included at an end of the mapping partial sequence data as end sequence data from the mapping partial sequence data generated by the reference mapping unit,
The de novo assembly unit compares the end sequence data extracted by the end sequence data extraction unit with a plurality of non-mapping fragment data extracted by the non-mapping fragment data extraction unit, and based on the comparison result, Generating data as assembled partial sequence data by combining partial sequence data and at least one non-mapping fragment data at a matching portion;
The genome sequence data generating unit combines the target genome with data obtained by combining the mapping partial sequence data generated by the reference mapping unit and the assembled partial sequence data generated by the de novo assembly unit at a portion indicating the end sequence data. It is generated as genome sequence data indicating the nucleotide sequence.

本発明によれば、例えば、解読された大量の断片配列からゲノム全体の塩基配列を復元することができる。 According to the present invention, for example, the base sequence of the entire genome can be restored from a large number of decoded fragment sequences.

実施の形態１におけるゲノム復元装置１００の機能構成図。FIG. 3 is a functional configuration diagram of the genome restoration device 100 according to Embodiment 1. リファレンスマッピングの概要図。Overview of reference mapping. デノボアセンブルの概要図。Schematic diagram of de novo assembly. 実施の形態１におけるゲノム復元装置１００のゲノム復元方法を示すフローチャート。5 is a flowchart illustrating a genome restoration method of the genome restoration apparatus 100 according to the first embodiment. 実施の形態１におけるゲノム復元方法のＳ１１０〜Ｓ１３０の概要を示す処理概要図。FIG. 4 is a process outline diagram showing an outline of S110 to S130 of the genome restoration method according to the first embodiment. 実施の形態１におけるゲノム復元方法のＳ１４０の概要を示す第１の処理概要図。FIG. 6 is a first processing outline diagram showing an outline of S140 of the genome restoration method according to the first embodiment. 実施の形態１におけるゲノム復元方法のＳ１４０の概要を示す第２の処理概要図。The 2nd process outline figure which shows the outline | summary of S140 of the genome restoration | reconstruction method in Embodiment 1. FIG. 実施の形態１におけるゲノム復元方法のＳ１５０の概要を示す処理概要図。FIG. 4 is a process outline diagram showing an outline of S150 of the genome restoration method according to the first embodiment. 実施の形態１におけるゲノム復元装置１００のハードウェア資源の一例を示す図。FIG. 3 is a diagram illustrating an example of hardware resources of the genome restoration device 100 according to the first embodiment.

実施の形態１．
断片化した大量の塩基配列からゲノム全体を復元するゲノム復元装置、方法およびプログラムの形態について説明する。 Embodiment 1 FIG.
A genome restoration apparatus, method, and program configuration for restoring the entire genome from a large amount of fragmented base sequences will be described.

ゲノムとは、染色体、ＤＮＡ（デオキシリボ核酸）、遺伝子などを意味する。 The genome means a chromosome, DNA (deoxyribonucleic acid), gene and the like.

図１は、実施の形態１におけるゲノム復元装置１００の機能構成図である。
実施の形態１におけるゲノム復元装置１００の機能構成について、図１に基づいて説明する。 FIG. 1 is a functional configuration diagram of the genome restoration apparatus 100 according to the first embodiment.
A functional configuration of the genome restoring apparatus 100 according to the first embodiment will be described with reference to FIG.

ゲノム復元装置１００（ゲノム配列特定装置の一例）は、リファレンスマッピング部１１０、ギャップ近傍配列抽出部１２０、デノボアセンブル部１３０、完全ゲノム復元部１４０および配列データ記憶部１９０を備える。 The genome restoration device 100 (an example of a genome sequence identification device) includes a reference mapping unit 110, a gap neighborhood sequence extraction unit 120, a de novo assembly unit 130, a complete genome restoration unit 140, and a sequence data storage unit 190.

以下、復元対象のゲノムを「対象ゲノム」という。
また、対象ゲノムを有する対象生物と種類が近い生物のゲノムを「近縁種ゲノム」という。 Hereinafter, the genome to be restored is referred to as “target genome”.
The genome of the target organism having the target genome is similar to that of the target organism.

配列データ記憶部１９０は、多数の断片配列データ１９１とリファレンス配列データ１９２を記憶する。
断片配列データ１９１は、対象ゲノムの塩基配列の断片を示すデータである。
リファレンス配列データ１９２（参照配列データ）は、近縁種ゲノム（既知ゲノム）の全体の塩基配列を示すデータである。 The sequence data storage unit 190 stores a large number of fragment sequence data 191 and reference sequence data 192.
The fragment sequence data 191 is data indicating a fragment of the base sequence of the target genome.
Reference sequence data 192 (reference sequence data) is data indicating the entire base sequence of a related species genome (known genome).

リファレンスマッピング部１１０は、複数の断片配列データ１９１とリファレンス配列データ１９２とを配列データ記憶部１９０から入力する。
リファレンスマッピング部１１０は、複数の断片配列データ１９１とリファレンス配列データ１９２とを比較する。
リファレンスマッピング部１１０は、比較結果に基づいて複数の断片配列データ１９１をリファレンス配列データ１９２に対応させて結合したデータをマッピング部分配列データ１０１Ａとして生成する。 The reference mapping unit 110 inputs a plurality of fragment sequence data 191 and reference sequence data 192 from the sequence data storage unit 190.
The reference mapping unit 110 compares the plurality of fragment sequence data 191 with the reference sequence data 192.
The reference mapping unit 110 generates, as mapping partial sequence data 101A, data obtained by combining a plurality of fragment sequence data 191 corresponding to the reference sequence data 192 based on the comparison result.

さらに、リファレンスマッピング部１１０（非マッピング断片データ抽出部の一例）は、複数の断片配列データ１９１から複数のレフトオーバー配列データ１０２（非マッピング断片データ）を抽出する。
レフトオーバー配列データ１０２とは、マッピング部分配列データ１０１Ａに含まれない複数の断片配列データ１９１である。 Further, the reference mapping unit 110 (an example of a non-mapping fragment data extraction unit) extracts a plurality of leftover sequence data 102 (non-mapping fragment data) from the plurality of fragment sequence data 191.
The leftover sequence data 102 is a plurality of fragment sequence data 191 that is not included in the mapping partial sequence data 101A.

ギャップ近傍配列抽出部１２０（端部配列データ抽出部の一例）は、リファレンスマッピング部１１０により生成されたマッピング部分配列データ１０１Ａからギャップ近傍配列データ１０３（端部配列データ）を抽出する。
ギャップ近傍配列データ１０３とは、マッピング部分配列データ１０１Ａの端部に含まれる断片配列データ１９１である。 The gap neighborhood sequence extraction unit 120 (an example of the end sequence data extraction unit) extracts the gap neighborhood sequence data 103 (end sequence data) from the mapping partial sequence data 101A generated by the reference mapping unit 110.
The gap vicinity sequence data 103 is fragment sequence data 191 included at the end of the mapping partial sequence data 101A.

具体的に、ギャップ近傍配列抽出部１２０は、複数のマッピング部分配列データ１０１Ａから複数のギャップ近傍配列データ１０３を抽出する。
例えば、ギャップ近傍配列抽出部１２０は、リファレンス配列データ１９２内でいずれのマッピング部分配列データ１０１Ａとも対応しない部分をギャップとして特定する。ギャップ近傍配列抽出部１２０は、特定したギャップ毎にギャップ前後のマッピング部分配列データ１０１Ａからギャップ側の端部の断片配列データをギャップ近傍配列データ１０３として抽出する。 Specifically, the gap vicinity array extraction unit 120 extracts a plurality of gap vicinity array data 103 from the plurality of mapping partial array data 101A.
For example, the gap neighborhood sequence extraction unit 120 identifies a portion in the reference sequence data 192 that does not correspond to any mapping partial sequence data 101A as a gap. The gap neighborhood sequence extraction unit 120 extracts, as gap neighborhood sequence data 103, fragment sequence data at the end on the gap side from the mapping partial sequence data 101A before and after the gap for each identified gap.

デノボアセンブル部１３０は、ギャップ近傍配列抽出部１２０により抽出されたギャップ近傍配列データ１０３とリファレンスマッピング部１１０により抽出された複数のレフトオーバー配列データ１０２とを比較する。
デノボアセンブル部１３０は、比較結果に基づいてアセンブル部分配列データ１０４を生成する。
アセンブル部分配列データ１０４とは、ギャップ近傍配列データ１０３と少なくともいずれかのレフトオーバー配列データ１０２とを一致部分で結合したデータである。 The de novo assembly unit 130 compares the gap neighborhood sequence data 103 extracted by the gap neighborhood sequence extraction unit 120 and the plurality of leftover sequence data 102 extracted by the reference mapping unit 110.
The de novo assembler 130 generates assemble partial array data 104 based on the comparison result.
The assembled partial sequence data 104 is data obtained by combining the gap neighboring sequence data 103 and at least one of the leftover sequence data 102 at the matching portion.

具体的に、デノボアセンブル部１３０は、複数のギャップ近傍配列データ１０３と複数のレフトオーバー配列データ１０２とを比較し、比較結果に基づいて複数のアセンブル部分配列データ１０４を生成する。
例えば、デノボアセンブル部１３０は、複数のギャップ近傍配列データ１０３と複数のレフトオーバー配列データ１０２とを比較し、比較結果に基づいて複数のギャップに対応する複数のアセンブル部分配列データを生成する。
また、デノボアセンブル部１３０は、ギャップ毎にギャップ前後のギャップ近傍配列データ１０３と複数のレフトオーバー配列データ１０２とを比較し、比較結果に基づいてアセンブル部分配列データ１０４をギャップ毎に生成する。 Specifically, the de novo assembly unit 130 compares the plurality of gap vicinity array data 103 and the plurality of leftover array data 102, and generates a plurality of assembly partial array data 104 based on the comparison result.
For example, the de novo assembler 130 compares the plurality of gap vicinity array data 103 and the plurality of leftover array data 102, and generates a plurality of assembled partial array data corresponding to the plurality of gaps based on the comparison result.
In addition, the de novo assembly unit 130 compares the gap vicinity arrangement data 103 before and after the gap and the plurality of leftover arrangement data 102 for each gap, and generates assembly partial arrangement data 104 for each gap based on the comparison result.

完全ゲノム復元部１４０（ゲノム配列データ生成部の一例）は、リファレンスマッピング部１１０により生成されたマッピング部分配列データ１０１Ａとデノボアセンブル部１３０により生成されたアセンブル部分配列データ１０４とを用いてゲノム配列データ１０５を生成する。
ゲノム配列データ１０５とは、マッピング部分配列データ１０１Ａとアセンブル部分配列データ１０４とをギャップ近傍配列データ１０３を示す部分で結合したデータである。ゲノム配列データ１０５は対象ゲノムの塩基配列を示す。 The complete genome restoration unit 140 (an example of a genome sequence data generation unit) uses the mapping partial sequence data 101A generated by the reference mapping unit 110 and the assembled partial sequence data 104 generated by the de novo assembly unit 130 to generate genome sequence data. 105 is generated.
The genome sequence data 105 is data obtained by combining the mapping partial sequence data 101A and the assembled partial sequence data 104 at a portion indicating the gap vicinity sequence data 103. The genome sequence data 105 indicates the base sequence of the target genome.

具体的に、完全ゲノム復元部１４０は、複数のマッピング部分配列データ１０１Ａと複数のアセンブル部分配列データ１０４とを結合してゲノム配列データ１０５を生成する。
例えば、完全ゲノム復元部１４０は、ギャップ毎にギャップ前後のマッピング部分配列データ１０１Ａとギャップに対応するアセンブル部分配列データ１０４とを結合してゲノム配列データ１０５を生成する。 Specifically, the complete genome restoring unit 140 combines the plurality of mapping partial sequence data 101A and the plurality of assembled partial sequence data 104 to generate the genome sequence data 105.
For example, the complete genome restoration unit 140 generates the genome sequence data 105 by combining the mapping partial sequence data 101A before and after the gap and the assembled partial sequence data 104 corresponding to the gap for each gap.

以下に、ゲノム復元装置１００のゲノム復元方法について説明する。 Hereinafter, a genome restoration method of the genome restoration apparatus 100 will be described.

ゲノム復元装置１００は、リファレンスマッピングとデノボアセンブルとを利用して多数の断片配列データ１９１から対象ゲノムの塩基配列データ（ゲノム配列データ１０５）を生成する。 The genome restoration apparatus 100 generates base sequence data (genome sequence data 105) of a target genome from a large number of fragment sequence data 191 using reference mapping and de novo assembly.

断片配列データ１９１は、ゲノムの塩基配列の断片を「Ａ（アデニン）」「Ｔ（チミン）」「Ｇ（グアニン）」「Ｃ（シトシン）」で示したテキストデータである。 The fragment sequence data 191 is text data indicating a fragment of a genome base sequence as “A (adenine)”, “T (thymine)”, “G (guanine)”, and “C (cytosine)”.

断片配列データ１９１は、シーケンサーと呼ばれる塩基配列解読装置によって生成される。
シーケンサーは、ゲノムの断片を電気泳動により分離して塩基配列を解読し、解読結果をデータ化して出力する装置である。シーケンサーが解読できる塩基配列の長さは１０００塩基程度である。 The fragment sequence data 191 is generated by a base sequence decoding device called a sequencer.
The sequencer is a device that separates genomic fragments by electrophoresis, decodes the base sequence, converts the decoded results into data, and outputs the data. The length of the base sequence that can be decoded by the sequencer is about 1000 bases.

図２は、リファレンスマッピングの概要図である。
リファレンスマッピングの概要について、図２に基づいて説明する。 FIG. 2 is a schematic diagram of reference mapping.
An overview of reference mapping will be described with reference to FIG.

リファレンスマッピングは、複数本の対象ゲノムから得られた多数の断片配列データを近縁種ゲノムの塩基配列データ（リファレンス配列データ）に対応させてマッピングすることにより、対象ゲノムの塩基配列データ（ゲノム配列データ）を生成する方法である。
リファレンスマッピングは、以下の処理手順で実行される。 In reference mapping, a large number of fragment sequence data obtained from multiple target genomes are mapped in correspondence with the base sequence data (reference sequence data) of the related species genome, so that the base sequence data (genome sequence) of the target genome is mapped. Data).
Reference mapping is executed by the following processing procedure.

手順１：各断片配列データとリファレンス配列データとを比較し、断片配列データ毎に当該断片配列データと一致（相同）する部分をリファレンス配列データから特定する。一致条件は完全一致以外の条件（例えば、所定割合以上で一致［類似］）を含む。
図２において、断片配列データ（ａ）はリファレンス配列データの１文字目から１０００文字目と一致し、断片配列データ（ｂ）はリファレンス配列データの３０１文字目から１３０１文字目と一致している。
リファレンス配列データのいずれの部分とも一致しない断片配列データ（レフトオーバー配列データ）は、以後の手順２で使用しない。 Procedure 1: Each fragment sequence data is compared with the reference sequence data, and a portion that matches (homologizes) with the fragment sequence data is specified for each fragment sequence data from the reference sequence data. The matching condition includes conditions other than perfect matching (for example, matching [similarity] at a predetermined ratio or more).
In FIG. 2, the fragment sequence data (a) matches the first to 1000th characters of the reference sequence data, and the fragment sequence data (b) matches the 301st to 1301th characters of the reference sequence data.
Fragment sequence data (left-over sequence data) that does not match any part of the reference sequence data is not used in the subsequent procedure 2.

手順２：各断片配列データをリファレンス配列データ内の一致部分と同じデータ位置に設定してゲノム配列データを生成する。
例えば、断片配列データ（ａ）をゲノム配列データの１文字目から１０００文字目に設定し、断片配列データ（ｂ）をゲノム配列データの３０１文字目から１３０１文字目に設定する。ゲノム配列データの３０１文字目から１０００文字目には断片配列データ（ｂ）が上書きされる。
部分配列データ（Ａ）は、断片配列データ（ａ）を含む９つの断片配列データから成る結合（連結、整列）データである。
ゲノム配列データは、部分配列データ（Ａ）（Ｂ）（Ｃ）を含んでいる。 Procedure 2: Each fragment sequence data is set at the same data position as the matching portion in the reference sequence data, and genome sequence data is generated.
For example, the fragment sequence data (a) is set from the first character to the 1000th character of the genome sequence data, and the fragment sequence data (b) is set from the 301st character to the 1301st character of the genome sequence data. The fragment sequence data (b) is overwritten from the 301st character to the 1000th character of the genome sequence data.
The partial sequence data (A) is combined (linked, aligned) data composed of nine pieces of fragment sequence data including the fragment sequence data (a).
The genome sequence data includes partial sequence data (A) (B) (C).

リファレンスマッピングでは、リファレンス配列データと一致しない部分の塩基配列を特定することができず、対象ゲノム全体の塩基配列を特定することができない。
以下、塩基配列を特定できなかった部分を「ギャップ」という。 In reference mapping, the base sequence of the part which does not correspond with reference sequence data cannot be specified, and the base sequence of the whole target genome cannot be specified.
Hereinafter, the part where the base sequence could not be specified is referred to as “gap”.

手順１のレフトオーバー配列データはギャップ部分に設定すべきデータであると考えられる。 The leftover sequence data in Procedure 1 is considered to be data to be set in the gap portion.

リファレンスマッピングには以下の長所がある。
（１）リファレンス配列データを利用するため比較的少なめの断片配列データから部分配列データを構築することができる。
（２）デノボアセンブルに比べて計算量が少ないため、計算機に高い処理能力が要求されない。 Reference mapping has the following advantages:
(1) Since the reference sequence data is used, partial sequence data can be constructed from a relatively small amount of fragment sequence data.
(2) Since the calculation amount is small compared to de novo assembly, a high processing capacity is not required for the computer.

リファレンスマッピングには以下の短所がある。
（１）リファレンス配列データとして近縁種ゲノムの塩基配列データが必要である。
（２）ギャップが残ってしまう。 Reference mapping has the following disadvantages:
(1) The base sequence data of the related species genome is required as the reference sequence data.
(2) A gap remains.

図３は、デノボアセンブルの概要図である。
デノボアセンブルの概要について、図３に基づいて説明する。 FIG. 3 is a schematic diagram of de novo assembly.
An outline of de novo assembly will be described with reference to FIG.

デノボアセンブルは、複数本の対象ゲノムから得られた多数の断片配列データをアセンブルすることにより、対象ゲノムの塩基配列データ（ゲノム配列データ）を生成する方法である。 The de novo assembly is a method for generating base sequence data (genome sequence data) of a target genome by assembling a large number of fragment sequence data obtained from a plurality of target genomes.

デノボアセンブルは、以下の処理手順で実行される。 De novo assembly is executed by the following processing procedure.

手順１：複数の断片配列データから断片配列データを一つ選択する。
以下、選択した断片配列データを「選択配列データ」という。 Procedure 1: Select one piece of fragment sequence data from a plurality of fragment sequence data.
Hereinafter, the selected fragment sequence data is referred to as “selected sequence data”.

手順２：選択配列データの端部と他の断片配列データの端部とを比較し、選択配列データの端部と一致する端部を含んだ断片配列データを抽出する。比較する端部は先頭部または終端部の所定長のデータである。一致条件は完全一致以外の条件も含む。
以下、抽出した断片配列データを「抽出配列データ」という。 Procedure 2: The end of the selected sequence data is compared with the end of the other fragment sequence data, and the fragment sequence data including the end that matches the end of the selected sequence data is extracted. The end to be compared is data of a predetermined length at the beginning or end. Matching conditions include conditions other than perfect matching.
Hereinafter, the extracted fragment sequence data is referred to as “extracted sequence data”.

手順３：選択配列データと抽出配列データとを一致部分で結合して部分配列データを生成する。以後、部分配列データを断片配列データの一つとして扱う。選択配列データと抽出配列データとは削除する。 Step 3: Partial sequence data is generated by combining selected sequence data and extracted sequence data at a matching portion. Hereinafter, the partial sequence data is treated as one of the fragment sequence data. The selected sequence data and the extracted sequence data are deleted.

手順１から手順３は、互いの端部が一致する断片配列データの組み合わせが無くなるまで繰り返し行う。 Procedures 1 to 3 are repeated until there are no more combinations of fragment sequence data whose ends match each other.

図３は、ゲノム配列データとして部分配列データ（Ａ）（Ｂ）（Ｃ）が生成されたことを示している。
例えば、部分配列データ（Ａ）は断片配列データ（ａ）を含む９つの断片配列データから成る結合（連結、整列）データである。 FIG. 3 shows that partial sequence data (A), (B), and (C) are generated as genome sequence data.
For example, the partial sequence data (A) is combined (linked, aligned) data composed of nine pieces of fragment sequence data including the fragment sequence data (a).

デノボアセンブルでは、部分配列データ間にギャップが生じ、ゲノム全体の塩基配列を特定することができない。 In de novo assembly, a gap occurs between partial sequence data, and the base sequence of the entire genome cannot be specified.

デノボアセンブルには「リファレンス配列データが不要である」という長所がある一方で以下のような短所がある。
（１）精度良く部分配列データを生成するためには比較する端部の長さを長く設定する必要があるが、比較する端部の長さを長くすると互いの端部が一致せずギャップが増えてしまう。
（２）計算量が多いため、計算機に高い処理能力が要求される。
（３）ギャップが残ってしまう上、各部分配列データが対象ゲノムのどの部分の塩基配列を示しているか分からない。 While de novo assembly has the advantage that “reference sequence data is not required”, it has the following disadvantages.
(1) In order to generate partial sequence data with high accuracy, it is necessary to set the length of the end portion to be compared to be long. However, if the length of the end portion to be compared is increased, the end portions do not coincide with each other and a gap is generated. It will increase.
(2) Since the calculation amount is large, a high processing capacity is required for the computer.
(3) In addition to leaving a gap, it is not known which part of the target genome each partial sequence data indicates.

ゲノム復元装置１００は、上記したリファレンスマッピングとデノボアセンブルとを利用して対象ゲノムの塩基配列データ（ゲノム配列データ１０５）を生成する。 The genome restoration device 100 generates base sequence data (genome sequence data 105) of the target genome using the reference mapping and de novo assembly described above.

図４は、実施の形態１におけるゲノム復元装置１００のゲノム復元方法を示すフローチャートである。
実施の形態１におけるゲノム復元装置１００のゲノム復元方法について、図４に基づいて説明する。 FIG. 4 is a flowchart showing the genome restoration method of the genome restoration apparatus 100 according to the first embodiment.
A genome restoration method of the genome restoration apparatus 100 according to Embodiment 1 will be described with reference to FIG.

Ｓ１１０（リファレンスマッピング処理の一例）において、リファレンスマッピング部１１０は、配列データ記憶部１９０から多数の断片配列データ１９１を入力すると共にリファレンス配列データ１９２を入力する。
配列データ記憶部１９０には、複数本の対象ゲノムから得られた多数の断片配列データ１９１と、近縁種ゲノムの塩基配列データ（リファレンス配列データ１９２）とが予め記憶されているものとする。 In S110 (an example of reference mapping processing), the reference mapping unit 110 inputs a large number of fragment sequence data 191 and the reference sequence data 192 from the sequence data storage unit 190.
It is assumed that the sequence data storage unit 190 stores in advance a large number of fragment sequence data 191 obtained from a plurality of target genomes and base sequence data (reference sequence data 192) of closely related genomes.

リファレンスマッピング部１１０は、リファレンス配列データ１９２を用いて多数の断片配列データ１９１をリファレンスマッピングする（図２参照）。
以下、リファレンスマッピングにより生成されるゲノム配列データを「ゲノム暫定配列データ１０１」という。 The reference mapping unit 110 performs reference mapping of a large number of fragment sequence data 191 using the reference sequence data 192 (see FIG. 2).
Hereinafter, the genome sequence data generated by the reference mapping is referred to as “genome provisional sequence data 101”.

ゲノム暫定配列データ１０１は、特定部分の塩基配列（部分配列）を「Ａ」「Ｔ」「Ｇ」「Ｃ」から成る文字列で示し、塩基配列が特定されなかった部分（ギャップ）を所定の文字列（例えば、複数の「０」）で示す。
以下、ゲノム暫定配列データ１０１が示す部分配列を「マッピング部分配列データ１０１Ａ」という。ゲノム暫定配列データ１０１は複数のマッピング部分配列データ１０１Ａを含む。
Ｓ１１０の後、Ｓ１２０に進む。 The provisional genome sequence data 101 indicates the base sequence (partial sequence) of a specific portion as a character string consisting of “A”, “T”, “G”, and “C”, and the portion (gap) for which the base sequence was not specified It is indicated by a character string (for example, a plurality of “0”).
Hereinafter, the partial sequence indicated by the genome provisional sequence data 101 is referred to as “mapping partial sequence data 101A”. The genome temporary sequence data 101 includes a plurality of mapping partial sequence data 101A.
It progresses to S120 after S110.

Ｓ１２０（非マッピング断片データ抽出処理の一例）において、リファレンスマッピング部１１０は、多数の断片配列データ１９１のうちゲノム暫定配列データ１０１に設定されなかった複数の断片配列データ１９１を特定する。
以下、Ｓ１２０で特定した各断片配列データ１９１を「レフトオーバー配列データ１０２」という。
Ｓ１２０の後、Ｓ１３０に進む。 In S120 (an example of non-mapping fragment data extraction processing), the reference mapping unit 110 identifies a plurality of fragment sequence data 191 that are not set in the genome provisional sequence data 101 among a large number of fragment sequence data 191.
Hereinafter, each piece of fragment sequence data 191 identified in S120 is referred to as “left-over sequence data 102”.
It progresses to S130 after S120.

Ｓ１３０（端部配列データ抽出処理の一例）において、ギャップ近傍配列抽出部１２０は、Ｓ１１０で生成されたゲノム暫定配列データ１０１を入力する。
ギャップ近傍配列抽出部１２０は、ゲノム暫定配列データ１０１に含まれるマッピング部分配列データ１０１Ａからギャップ前後に設定されている所定長のデータ（断片配列データ１９１）を抽出する。
以下、Ｓ１３０で抽出したデータを「ギャップ近傍配列データ１０３」という。 In S130 (an example of end sequence data extraction processing), the gap vicinity sequence extraction unit 120 inputs the genome provisional sequence data 101 generated in S110.
The gap vicinity sequence extraction unit 120 extracts data of a predetermined length (fragment sequence data 191) set before and after the gap from the mapping partial sequence data 101A included in the genome provisional sequence data 101.
Hereinafter, the data extracted in S130 is referred to as “gap neighborhood array data 103”.

リファレンスマッピング（Ｓ１１０）の際にリファレンスマッピング部１１０が各ギャップの前後に設定した断片配列データ１９１の識別情報を記録しておき、記録された識別情報で識別される断片配列データ１９１をギャップ近傍配列抽出部１２０がギャップ近傍配列データ１０３として配列データ記憶部１９０から入力しても構わない。 In the reference mapping (S110), the reference mapping unit 110 records the identification information of the fragment arrangement data 191 set before and after each gap, and the fragment arrangement data 191 identified by the recorded identification information is arranged in the vicinity of the gap. The extraction unit 120 may input from the array data storage unit 190 as the gap vicinity array data 103.

Ｓ１３０の後、Ｓ１４０に進む。 It progresses to S140 after S130.

図５は、実施の形態１におけるゲノム復元方法のＳ１１０〜Ｓ１３０の概要を示す処理概要図である。
実施の形態１におけるゲノム復元方法のＳ１１０〜Ｓ１３０の概要について、図５に基づいて説明する。 FIG. 5 is a process outline diagram showing an outline of S110 to S130 of the genome restoration method according to the first embodiment.
An overview of S110 to S130 of the genome restoration method according to Embodiment 1 will be described with reference to FIG.

リファレンスマッピング部１１０は、リファレンス配列データ１９２の一部と一致する各断片配列データ１９１を一致部分と同じデータ位置に設定してゲノム暫定配列データ１０１を生成する（Ｓ１１０）。
リファレンスマッピング部１１０は、ゲノム暫定配列データ１０１に設定しなかった複数の断片配列データ１９１（レフトオーバー配列データ１０２）を特定する（Ｓ１２０）。
ギャップ近傍配列抽出部１２０は、ゲノム暫定配列データ１０１にギャップ前後のデータとして設定された複数の断片配列データ１９１（ギャップ近傍配列データ１０３）を抽出する（Ｓ１３０）。 The reference mapping unit 110 sets the fragment sequence data 191 that matches a part of the reference sequence data 192 at the same data position as the matched portion, and generates the genome provisional sequence data 101 (S110).
The reference mapping unit 110 identifies a plurality of fragment sequence data 191 (left-over sequence data 102) not set in the genome provisional sequence data 101 (S120).
The gap neighborhood sequence extraction unit 120 extracts a plurality of fragment sequence data 191 (gap neighborhood sequence data 103) set as data before and after the gap in the genome provisional sequence data 101 (S130).

図４に戻り、ゲノム復元方法の説明を続ける。 Returning to FIG. 4, the description of the genome restoration method will be continued.

Ｓ１４０（デノボアセンブル処理の一例）において、デノボアセンブル部１３０は、Ｓ１２０で特定された複数のレフトオーバー配列データ１０２とＳ１３０で抽出された複数のギャップ近傍配列データ１０３とを入力する。
デノボアセンブル部１３０は、複数のレフトオーバー配列データ１０２と複数のギャップ近傍配列データ１０３とをデノボアセンブルする（図３参照）。
以下、デノボアセンブルにより生成される複数の部分配列データを「アセンブル部分配列データ１０４」という。 In S140 (an example of de novo assembly process), the de novo assembly unit 130 inputs the plurality of leftover array data 102 specified in S120 and the plurality of gap neighboring array data 103 extracted in S130.
The de novo assembler 130 de novo assembles the plurality of leftover array data 102 and the plurality of gap vicinity array data 103 (see FIG. 3).
Hereinafter, a plurality of partial array data generated by de novo assembly is referred to as “assembled partial array data 104”.

例えば、デノボアセンブル部１３０は、ギャップ毎にギャップ前後のギャップ近傍配列データ１０３と全てのレフトオーバー配列データ１０２とをデノボアセンブルし、ギャップ毎にアセンブル部分配列データ１０４を生成する。
この場合、ゲノム復元装置１００を複数の計算機（ＣＰＵ）を備えた並列計算機として構成した上で各ギャップに対するデノボアセンブルを並列処理するとよい。これにより、処理時間を短縮することができる。また、ギャップ毎に別々にデノボアセンブルを行うため、特定のギャップのギャップ近傍配列データ１０３が他のギャップのギャップ近傍配列データ１０３と類似していてもギャップ近傍配列データ１０３の類似の影響を受けずに各ギャップのアセンブル部分配列データ１０４を生成することができる。
または、各ギャップに対するデノボアセンブルをギャップのデータ位置順（またはランダム）に行い、アセンブル部分配列データ１０４に設定済みのレフトオーバー配列データ１０２を次回以降のデノボアセンブルで除外してデータ量・計算量を削減してもよい。 For example, the de novo assembly unit 130 de novo assembles the gap vicinity arrangement data 103 before and after the gap and all the leftover arrangement data 102 for each gap, and generates the assembled partial arrangement data 104 for each gap.
In this case, the genome restoration apparatus 100 may be configured as a parallel computer including a plurality of computers (CPUs), and de novo assembly for each gap may be processed in parallel. Thereby, processing time can be shortened. Further, since de novo assembly is performed separately for each gap, even if the gap neighborhood sequence data 103 of a specific gap is similar to the gap neighborhood sequence data 103 of other gaps, it is not affected by the similarity of the gap neighborhood sequence data 103. In addition, assembling partial arrangement data 104 of each gap can be generated.
Alternatively, de novo assembly for each gap is performed in the order of the data position of the gap (or randomly), and the leftover array data 102 set in the assembly partial array data 104 is excluded by the subsequent de novo assembly, so that the data amount and calculation amount are reduced. It may be reduced.

デノボアセンブル部１３０は、全てのギャップ近傍配列データ１０３と全てのレフトオーバー配列データ１０２とをまとめてデノボアセンブルし、複数のアセンブル部分配列データ１０４を一回のデノボアセンブルによって生成してもよい。 The de novo assembly unit 130 may de novo assemble all gap neighborhood arrangement data 103 and all leftover arrangement data 102 together, and generate a plurality of assembly partial arrangement data 104 by one de novo assembly.

Ｓ１４０の後、Ｓ１５０に進む。 After S140, the process proceeds to S150.

図６は、実施の形態１におけるゲノム復元方法のＳ１４０の概要を示す第１の処理概要図である。
図７は、実施の形態１におけるゲノム復元方法のＳ１４０の概要を示す第２の処理概要図である。
実施の形態１におけるゲノム復元方法のＳ１４０の概要として、ギャップ毎に行うデノボアセンブルの処理概要を図６に基づいて説明し、全ギャップに対してまとめて行うデノボアセンブルの処理概要を図７に基づいて説明する。 FIG. 6 is a first process overview diagram illustrating an overview of S140 of the genome restoration method according to the first embodiment.
FIG. 7 is a second process overview diagram illustrating an overview of S140 of the genome restoration method according to the first embodiment.
As an outline of S140 of the genome restoration method according to the first embodiment, an outline of the de novo assembly process performed for each gap will be described based on FIG. 6, and an outline of the de novo assembly process performed for all the gaps will be described based on FIG. I will explain.

図６において、デノボアセンブル部１３０は、第１ギャップのギャップ近傍配列データ１０３と全てのレフトオーバー配列データ１０２とを用いて第１の小規模デノボアセンブルを行い、第１ギャップに対応するアセンブル部分配列データ１０４を生成する。
さらに、デノボアセンブル部１３０は、第２ギャップのギャップ近傍配列データ１０３と全てのレフトオーバー配列データ１０２とを用いて第２の小規模デノボアセンブルを行い、第２ギャップに対応するアセンブル部分配列データ１０４を生成する。 In FIG. 6, the de novo assembly unit 130 performs the first small-scale de novo assembly using the gap vicinity arrangement data 103 of the first gap and all the leftover arrangement data 102, and assembles a partial arrangement corresponding to the first gap. Data 104 is generated.
Further, the de novo assembly unit 130 performs the second small-scale de novo assembly using the gap neighboring arrangement data 103 of the second gap and all the leftover arrangement data 102, and the assembled partial arrangement data 104 corresponding to the second gap. Is generated.

小規模デノボアセンブルとは、リファレンスマッピング（Ｓ１１０）でマッピングされなかった余りの断片配列データ１９１（レフトオーバー配列データ１０２）を用いて行うデノボアセンブルを意味する。 The small-scale de novo assembly means de novo assembly using the remaining fragment arrangement data 191 (leftover arrangement data 102) that has not been mapped in the reference mapping (S110).

図７において、デノボアセンブル部１３０は、第１ギャップのギャップ近傍配列データ１０３と第２ギャップのギャップ近傍配列データ１０３と全てのレフトオーバー配列データ１０２とを用いて小規模デノボアセンブルを行う。
これにより、第１ギャップに対応するアセンブル部分配列データ１０４と第２ギャップに対応するアセンブル部分配列データ１０４とが生成される。第Ｘギャップのギャップ近傍配列データ１０３を含んだデータが第Ｘギャップに対応するアセンブル部分配列データ１０４である。 In FIG. 7, the de novo assembly unit 130 performs small-scale de novo assembly using the gap vicinity arrangement data 103 of the first gap, the gap vicinity arrangement data 103 of the second gap, and all the leftover arrangement data 102.
As a result, assembly partial arrangement data 104 corresponding to the first gap and assembly partial arrangement data 104 corresponding to the second gap are generated. Data including the gap vicinity arrangement data 103 of the Xth gap is the assembled partial arrangement data 104 corresponding to the Xth gap.

Ｓ１５０（ゲノム配列データ生成処理の一例）において、完全ゲノム復元部１４０は、Ｓ１１０で生成されたゲノム暫定配列データ１０１とＳ１４０で生成された複数のアセンブル部分配列データ１０４とを入力する。
完全ゲノム復元部１４０は、ゲノム暫定配列データ１０１の各ギャップに当該ギャップに対応するアセンブル部分配列データ１０４を設定する。
以下、ゲノム暫定配列データ１０１にアセンブル部分配列データ１０４を設定したデータを「ゲノム配列データ１０５」という。 In S150 (an example of genome sequence data generation processing), the complete genome restoration unit 140 receives the genome provisional sequence data 101 generated in S110 and the plurality of assembled partial sequence data 104 generated in S140.
The complete genome restoring unit 140 sets the assembled partial sequence data 104 corresponding to the gap in each gap of the genome provisional sequence data 101.
Hereinafter, data in which the assembled partial sequence data 104 is set to the genome temporary sequence data 101 is referred to as “genome sequence data 105”.

ゲノム配列データ１０５は、ゲノム暫定配列データ１０１のギャップをアセンブル部分配列データ１０４で穴埋めしたデータであるため、対象ゲノムの全体の塩基配列を示すことができる。
Ｓ１５０により、ゲノム復元方法の処理は終了する。 Since the genome sequence data 105 is data in which the gap of the genome provisional sequence data 101 is filled with the assembled partial sequence data 104, the entire base sequence of the target genome can be indicated.
By S150, the process of the genome restoration method ends.

図８は、実施の形態１におけるゲノム復元方法のＳ１５０の概要を示す処理概要図である。
実施の形態１におけるゲノム復元方法のＳ１５０の概要について、図８に基づいて説明する。 FIG. 8 is a process outline diagram showing an outline of S150 of the genome restoration method according to the first embodiment.
An overview of S150 of the genome restoration method according to Embodiment 1 will be described with reference to FIG.

ゲノム暫定配列データ１０１は、複数のマッピング部分配列データ１０１Ａを含む。
完全ゲノム復元部１４０は、マッピング部分配列データ１０１Ａ間のギャップにアセンブル部分配列データ１０４（ギャップ近傍配列データ１０３を除いた部分）を設定してゲノム配列データ１０５を生成する。
つまり、ゲノム配列データ１０５は、複数のマッピング部分配列データ１０１Ａと複数のアセンブル部分配列データ１０４とをギャップ近傍配列データ１０３が重なるように結合させたデータである。 The temporary genome sequence data 101 includes a plurality of mapping partial sequence data 101A.
The complete genome restoration unit 140 sets the assembled partial sequence data 104 (portion excluding the gap neighboring sequence data 103) to the gap between the mapping partial sequence data 101A and generates the genome sequence data 105.
That is, the genome sequence data 105 is data obtained by combining a plurality of mapping partial sequence data 101A and a plurality of assembly partial sequence data 104 so that the gap vicinity sequence data 103 overlaps.

図９は、実施の形態１におけるゲノム復元装置１００のハードウェア資源の一例を示す図である。
図９において、ゲノム復元装置１００は、ＣＰＵ９１１（Ｃｅｎｔｒａｌ・Ｐｒｏｃｅｓｓｉｎｇ・Ｕｎｉｔ）（マイクロプロセッサ、マイクロコンピュータともいう）を備えている。ＣＰＵ９１１は、バス９１２を介してＲＯＭ９１３、ＲＡＭ９１４、通信ボード９１５、表示装置９０１、キーボード９０２、マウス９０３、ドライブ装置９０４、磁気ディスク装置９２０と接続され、これらのハードウェアデバイスを制御する。ドライブ装置９０４は、ＦＤ（Ｆｌｅｘｉｂｌｅ・Ｄｉｓｋ・Ｄｒｉｖｅ）、ＣＤ（ＣｏｍｐａｃｔＤｉｓｃ）、ＤＶＤ（Ｄｉｇｉｔａｌ・Ｖｅｒｓａｔｉｌｅ・Ｄｉｓｃ）などの記憶媒体を読み書きする装置である。 FIG. 9 is a diagram illustrating an example of hardware resources of the genome restoration device 100 according to the first embodiment.
In FIG. 9, the genome restoration apparatus 100 includes a CPU 911 (Central Processing Unit) (also referred to as a microprocessor or a microcomputer). The CPU 911 is connected to the ROM 913, the RAM 914, the communication board 915, the display device 901, the keyboard 902, the mouse 903, the drive device 904, and the magnetic disk device 920 via the bus 912, and controls these hardware devices. The drive device 904 is a device that reads and writes a storage medium such as an FD (Flexible Disk Drive), a CD (Compact Disc), and a DVD (Digital Versatile Disc).

通信ボード９１５は、有線または無線で、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、インターネット、電話回線などの通信網に接続している。 The communication board 915 is wired or wirelessly connected to a communication network such as a LAN (Local Area Network), the Internet, or a telephone line.

磁気ディスク装置９２０には、ＯＳ９２１（オペレーティングシステム）、ウィンドウシステム９２２、プログラム群９２３、ファイル群９２４が記憶されている。 The magnetic disk device 920 stores an OS 921 (operating system), a window system 922, a program group 923, and a file group 924.

プログラム群９２３には、実施の形態において「〜部」として説明する機能を実行するプログラムが含まれる。プログラムは、ＣＰＵ９１１により読み出され実行される。すなわち、プログラムは、「〜部」としてコンピュータを機能させるものであり、また「〜部」の手順や方法をコンピュータに実行させるものである。 The program group 923 includes programs that execute the functions described as “units” in the embodiment. The program is read and executed by the CPU 911. That is, the program causes the computer to function as “to part”, and causes the computer to execute the procedures and methods of “to part”.

ファイル群９２４には、実施の形態において説明する「〜部」で使用される各種データ（入力、出力、判定結果、計算結果、処理結果など）が含まれる。 The file group 924 includes various data (input, output, determination result, calculation result, processing result, etc.) used in “˜part” described in the embodiment.

実施の形態において構成図およびフローチャートに含まれている矢印は主としてデータや信号の入出力を示す。 In the embodiment, arrows included in the configuration diagrams and flowcharts mainly indicate input and output of data and signals.

実施の形態において「〜部」として説明するものは「〜回路」、「〜装置」、「〜機器」であってもよく、また「〜ステップ」、「〜手順」、「〜処理」であってもよい。すなわち、「〜部」として説明するものは、ファームウェア、ソフトウェア、ハードウェアまたはこれらの組み合わせのいずれで実装されても構わない。 In the embodiment, what is described as “to part” may be “to circuit”, “to apparatus”, and “to device”, and “to step”, “to procedure”, and “to processing”. May be. That is, what is described as “to part” may be implemented by any of firmware, software, hardware, or a combination thereof.

実施の形態１において、リファレンスマッピングとデノボアセンブルとを組み合わせて対象ゲノムの塩基配列を特定するゲノム復元装置、方法およびプログラムについて説明した。
実施の形態１により、リファレンスマッピングでは特定できなかった対象ゲノムのギャップ部分の塩基配列を特定することができる。
また、リファレンスマッピングで余った断片配列データ（レフトオーバー配列データ）をデノボアセンブルすることにより、全ての断片配列データをデノボアセンブルする場合よりも計算量を減らし、処理能力が比較的低い計算機を用いて対象ゲノムの塩基配列を特定することができる。 In Embodiment 1, the genome restoration apparatus, method, and program for specifying the base sequence of the target genome by combining reference mapping and de novo assembly have been described.
According to Embodiment 1, it is possible to specify the base sequence of the gap portion of the target genome that could not be specified by reference mapping.
Also, by de novo assembling the remaining fragment sequence data (leftover sequence data) in the reference mapping, the amount of calculation is reduced as compared with the case of de novo assembling all the fragment sequence data, and a computer with relatively low processing capability is used. The base sequence of the target genome can be specified.

１００ゲノム復元装置、１０１ゲノム暫定配列データ、１０１Ａマッピング部分配列データ、１０２レフトオーバー配列データ、１０３ギャップ近傍配列データ、１０４アセンブル部分配列データ、１０５ゲノム配列データ、１１０リファレンスマッピング部、１２０ギャップ近傍配列抽出部、１３０デノボアセンブル部、１４０完全ゲノム復元部、１９０配列データ記憶部、１９１断片配列データ、１９２リファレンス配列データ、９０１表示装置、９０２キーボード、９０３マウス、９０４ドライブ装置、９１１ＣＰＵ、９１２バス、９１３ＲＯＭ、９１４ＲＡＭ、９１５通信ボード、９２０磁気ディスク装置、９２１ＯＳ、９２２ウィンドウシステム、９２３プログラム群、９２４ファイル群。 100 Genome Restoration Device, 101 Genome Temporary Sequence Data, 101A Mapping Partial Sequence Data, 102 Left Over Sequence Data, 103 Gap Near Sequence Data, 104 Assemble Partial Sequence Data, 105 Genome Sequence Data, 110 Reference Mapping Unit, 120 Gap Near Sequence Extraction Unit, 130 de novo assembly unit, 140 complete genome restoration unit, 190 sequence data storage unit, 191 fragment sequence data, 192 reference sequence data, 901 display device, 902 keyboard, 903 mouse, 904 drive device, 911 CPU, 912 bus, 913 ROM, 914 RAM, 915 communication board, 920 magnetic disk unit, 921 OS, 922 window system, 923 program group, 924 file group

Claims

In a genome sequence identification device that identifies the base sequence of a target genome,
Input a plurality of fragment sequence data indicating fragments of the base sequence of the target genome, input reference sequence data indicating a known genome base sequence whose base sequence is specified, and a plurality of fragment sequence data and the reference sequence data. A reference mapping unit that generates a mapping partial sequence data by combining and combining a plurality of fragment sequence data corresponding to the reference sequence data based on the comparison result;
A non-mapping fragment data extraction unit that extracts a plurality of fragment sequence data not included in the mapping partial sequence data generated by the reference mapping unit from a plurality of fragment sequence data as a plurality of non-mapping fragment data;
An end sequence data extraction unit that extracts, as end sequence data, fragment sequence data included in the end of the mapping partial sequence data from the mapping partial sequence data generated by the reference mapping unit;
The end sequence data extracted by the end sequence data extraction unit is compared with a plurality of non-mapping fragment data extracted by the non-mapping fragment data extraction unit, and at least the end sequence data is compared with the end sequence data based on the comparison result A de novo assembly part that generates data obtained by combining any non-mapping fragment data with a matching part as assembled partial array data;
Genomic sequence indicating the base sequence of the target genome by combining the mapping partial sequence data generated by the reference mapping unit and the assembled partial sequence data generated by the de novo assembly unit at a portion indicating the end sequence data A genome sequence specifying apparatus comprising a genome sequence data generation unit that generates data.

The reference mapping unit generates a plurality of mapping partial array data,
The end sequence data extraction unit extracts a plurality of end sequence data from a plurality of mapping partial sequence data,
The de novo assembly part compares a plurality of end part arrangement data and a plurality of non-mapping fragment data, and generates a plurality of assembly partial arrangement data based on the comparison result,
The genome sequence identification device according to claim 1, wherein the genome sequence data generation unit generates the genome sequence data by combining a plurality of mapping partial sequence data and a plurality of assembly partial sequence data.

The end sequence data extraction unit identifies a portion that does not correspond to any mapping partial sequence data in the reference sequence data as a gap, and for each identified gap, from the mapping partial sequence data before and after the gap, Extract edge sequence data,
The de novo assembly part compares a plurality of end array data and a plurality of non-mapping fragment data, and generates a plurality of assembled partial array data corresponding to a plurality of gaps based on the comparison result,
The genome sequence identification device according to claim 2, wherein the genome sequence data generation unit generates the genome sequence data by combining a plurality of mapping partial sequence data and a plurality of assembly partial sequence data.

The de novo assembly part compares the end arrangement data before and after the gap and a plurality of non-mapping fragment data for each gap, and generates assembly partial arrangement data for each gap based on the comparison result,
4. The genome according to claim 3, wherein the genome sequence data generation unit generates the genome sequence data by combining the mapping partial sequence data before and after the gap and the assembled partial sequence data corresponding to the gap for each gap. Sequence identification device.

In the genome sequence identification program that identifies the base sequence of the target genome,
Input a plurality of fragment sequence data indicating fragments of the base sequence of the target genome, input reference sequence data indicating a known genome base sequence whose base sequence is specified, and a plurality of fragment sequence data and the reference sequence data. A reference mapping process for generating, as mapping partial sequence data, a plurality of fragment sequence data corresponding to the reference sequence data and combining them based on the comparison results,
A non-mapping fragment data extraction process for extracting a plurality of fragment sequence data not included in the mapping partial sequence data generated by the reference mapping process from a plurality of fragment sequence data as a plurality of non-mapping fragment data;
End sequence data extraction processing for extracting fragment sequence data included in the end of the mapping partial sequence data from the mapping partial sequence data generated by the reference mapping processing as end sequence data;
The end sequence data extracted by the end sequence data extraction process is compared with a plurality of non-mapping fragment data extracted by the non-mapping fragment data extraction process, and at least the end sequence data is compared with the end sequence data based on the comparison result De novo assembly processing for generating data that combines any non-mapping fragment data with a matching part as assembled partial array data;
A genome sequence indicating the base sequence of the target genome, which is obtained by combining the mapping partial sequence data generated by the reference mapping process and the assembled partial sequence data generated by the de novo assembly process at a portion indicating the end sequence data A genome sequence specifying program that causes a computer to execute genome sequence data generation processing to be generated as data.

The reference mapping process generates a plurality of mapping partial array data,
The end sequence data extraction process extracts a plurality of end sequence data from a plurality of mapping partial sequence data,
The de novo assembly process compares a plurality of end sequence data and a plurality of non-mapping fragment data, generates a plurality of assembled partial sequence data based on the comparison result,
6. The genome sequence identification program according to claim 5, wherein the genome sequence data generation processing generates the genome sequence data by combining a plurality of mapping partial sequence data and a plurality of assembly partial sequence data.

The end sequence data extraction processing specifies a portion that does not correspond to any mapping partial sequence data in the reference sequence data as a gap, and for each specified gap, the mapping sequence sequence data before and after the gap Extract edge sequence data,
The de novo assembly process compares a plurality of end sequence data and a plurality of non-mapping fragment data, generates a plurality of assembled partial sequence data corresponding to a plurality of gaps based on the comparison result,
The genome sequence identification program according to claim 6, wherein the genome sequence data generation processing generates the genome sequence data by combining a plurality of mapping partial sequence data and a plurality of assembly partial sequence data.

The de novo assembly process compares the end sequence data before and after the gap and a plurality of non-mapping fragment data for each gap, and generates assembled partial sequence data for each gap based on the comparison result,
8. The genome according to claim 7, wherein the genome sequence data generation processing combines the mapping partial sequence data before and after the gap and the assembled partial sequence data corresponding to the gap for each gap to generate the genomic sequence data. Sequence identification program.

In the genome sequence specifying method of the genome sequence specifying device for specifying the base sequence of the target genome,
The reference mapping unit inputs a plurality of fragment sequence data indicating fragments of the base sequence of the target genome, inputs reference sequence data indicating the base sequence of a known genome whose base sequence is specified, a plurality of fragment sequence data and the above-mentioned Compared with reference sequence data, a plurality of fragment sequence data based on the comparison result corresponding to the reference sequence data to generate data as mapping partial sequence data,
A non-mapping fragment data extraction unit extracts a plurality of fragment sequence data not included in the mapping partial sequence data generated by the reference mapping unit from a plurality of fragment sequence data as a plurality of non-mapping fragment data;
An end sequence data extraction unit extracts fragment sequence data included at an end of the mapping partial sequence data as end sequence data from the mapping partial sequence data generated by the reference mapping unit,
The de novo assembly unit compares the end sequence data extracted by the end sequence data extraction unit with a plurality of non-mapping fragment data extracted by the non-mapping fragment data extraction unit, and based on the comparison result, Generating data as assembled partial sequence data by combining partial sequence data and at least one non-mapping fragment data at a matching portion;
The genome sequence data generating unit combines the target genome with data obtained by combining the mapping partial sequence data generated by the reference mapping unit and the assembled partial sequence data generated by the de novo assembly unit at a portion indicating the end sequence data. A genome sequence specifying method for a genome sequence specifying apparatus, characterized in that the data is generated as genome sequence data indicating the base sequence of the genome sequence.