JP7089804B2

JP7089804B2 - A storage medium that stores a data creation device, a data creation method, and a data creation program.

Info

Publication number: JP7089804B2
Application number: JP2020508869A
Authority: JP
Inventors: 竜仲木; 仙太郎與島; 真輝人小林; 大騎村上
Original assignee: Rhelixa
Current assignee: Rhelixa
Priority date: 2018-03-30
Filing date: 2018-03-30
Publication date: 2022-06-23
Anticipated expiration: 2038-03-30
Also published as: JPWO2019187100A1; WO2019187100A1

Description

本発明は、ゲノムシーケンサーによる読み取り頻度を再現できるデータを作成するデータ作成装置、データ作成方法及びデータ作成プログラムを記憶した記憶媒体に関する。 The present invention relates to a data creation device that creates data that can reproduce the reading frequency by a genome sequencer, a data creation method, and a storage medium that stores a data creation program.

生体のゲノム情報は、多様な用途に活用されることが期待されている。 It is expected that the genomic information of living organisms will be utilized for various purposes.

例えば、人又は動物のゲノム情報に基づいて、人又は動物の遺伝的体質を解析したり、人又は動物の疾病の発症を予測したり、人又は動物の病気の進行度合いを把握したりすることが期待されている。また、植物又は微生物のゲノム情報に基づいて、土壌、水又は生産物の最適化などを行うことも期待されている。 For example, analyzing the genetic constitution of a human or animal based on the genomic information of a human or an animal, predicting the onset of a disease of a human or an animal, or grasping the progress of a disease of a human or an animal. Is expected. It is also expected to optimize soil, water or products based on the genomic information of plants or microorganisms.

このようなゲノム情報の活用に当たっては、ゲノム情報を数多く収集することが必要となる。しかし、一般的に、ゲノム情報を示すデータは、非常に大きなデータ容量となることが多い。例えば、人のゲノム情報であるヒトゲノムを再現するための配列群データは数１００ギガバイトのデータ容量に至る。 In utilizing such genomic information, it is necessary to collect a large amount of genomic information. However, in general, data showing genomic information often has a very large data capacity. For example, the sequence group data for reproducing the human genome, which is the human genome information, reaches a data capacity of several hundred gigabytes.

このため、すべてのゲノム情報をそのまま保存または送信すると、データベースの記憶容量が圧迫されたり、又は通信回線が逼迫したりする可能性がある。 Therefore, if all the genomic information is stored or transmitted as it is, the storage capacity of the database may be compressed or the communication line may be tight.

このため、ゲノム情報のデータ容量を削減することが重要な課題となる。 Therefore, it is an important issue to reduce the data capacity of genomic information.

特許文献１には、基準となるゲノムデータと各人のゲノムデータとを比較し、基準となるゲノムデータと各人のゲノムデータとで異なる塩基情報のみを保存し、伝送することで、通常のゲノムデータの約０．１％のデータ容量に圧縮する技術が提案されている。 In Patent Document 1, the reference genome data and each person's genome data are compared, and only the base information different between the reference genome data and each person's genome data is stored and transmitted. A technique for compressing the data volume to about 0.1% of the genomic data has been proposed.

国際公開第２０１５／１４６８５２号International Publication No. 2015/146852

しかしながら、特許文献１の技術は、各人のゲノムデータの塩基記号の並び（ＡＣＧＴの並び）を再現するにとどまっている。すなわち、特許文献１の技術では、塩基記号以外の情報、例えば、ゲノムシーケンサーによる塩基情報の読み取り頻度等を再現することができなかった。 However, the technique of Patent Document 1 only reproduces the sequence of base symbols (sequence of ACGT) of the genome data of each person. That is, the technique of Patent Document 1 could not reproduce information other than the base symbol, for example, the frequency of reading the base information by the genome sequencer.

一般的に、ゲノムシーケンサーは、対象のゲノム情報を読み取るにあたり、一回の読み取りでは、ゲノム情報の全部（人の場合、約３１億塩基対）ではなく、ゲノム情報の一部のデータ（以下、適宜「リード」という。）を読み取る。一回の読み取りで読み取られるリードに含まれる塩基配列は、例えば、５０塩基対程度である。 In general, when a genome sequencer reads a target genome information, a single reading does not mean the entire genome information (about 3.1 billion base pairs in the case of a human), but a part of the genome information data (hereinafter referred to as “)”. Read "read" as appropriate). The base sequence contained in the read read by one reading is, for example, about 50 base pairs.

ゲノムシーケンサーは、読み取ったリードに含まれる塩基配列でゲノム情報の全部を再構成できる程度になるまで、リードを繰り返し読み取るように構成されている。 The genome sequencer is configured to read reads repeatedly until the base sequence contained in the read reads can reconstruct all of the genomic information.

ここで、ゲノムシーケンサーは、ゲノム情報の全体にわたってリードを均一に読み取るとは限らず、ある個所においては高頻度で読み取ったり、別なある個所では低頻度で読み取ったりする。この結果、それぞれの塩基配列の読み取り頻度にはばらつきが生じうる。 Here, the genome sequencer does not always read the read uniformly over the entire genomic information, and reads it frequently in one place and infrequently in another place. As a result, the frequency of reading each base sequence may vary.

ゲノムの分子修飾や相互作用たんぱく質が結合する部位とその統計的有意性を判断する上で、ゲノムシーケンサーによる読み取り頻度は有用な指標となる。ゲノムシーケンサーによる読み取り頻度のばらつきを解析することで、塩基記号の並びの再現以外の情報が得られる可能性がある。 The frequency of reading by the genome sequencer is a useful index for determining the site to which molecular modification of the genome and the interaction protein binds and its statistical significance. By analyzing the variation in reading frequency by the genome sequencer, it is possible to obtain information other than the reproduction of the sequence of base symbols.

しかし、上述したように、特許文献１の技術は、ゲノムシーケンサーによる読み取り頻度を再現することができなかった。 However, as described above, the technique of Patent Document 1 could not reproduce the reading frequency by the genome sequencer.

そこで、本発明は、データ容量を抑えながら、ゲノムシーケンサーによる読み取り頻度を再現できるデータを作成するデータ作成装置、データ作成方法及びデータ作成プログラムを記憶した記憶媒体を提供することを目的とする。 Therefore, an object of the present invention is to provide a data creation device for creating data capable of reproducing the reading frequency by a genome sequencer, a data creation method, and a storage medium for storing a data creation program while suppressing the data capacity.

本発明のデータ作成装置は、長さが第１塩基数である第１塩基配列データを記憶する第１塩基配列記憶部と、第１塩基配列記憶部に記憶された前記第１塩基配列データに基づいて、個々の長さが前記第１塩基数よりも短い第２塩基数である各第２塩基配列データについて、当該第２塩基配列データに対応する第１塩基配列データ中の部分配列の位置を示す数値を認識する位置認識部と、各第２塩基配列データに対応する第１塩基配列データの部分配列の位置を示す数値を昇順または降順で並び替えることにより、位置の配列を作成する並替部と、前記位置の配列に含まれる少なくとも一つの要素である基準要素を認識する基準要素認識部と、前記位置の配列に含まれ、かつ、隣り合う要素間の差分を認識する差分認識部と、前記基準要素認識部により認識された基準要素と前記差分認識部により認識された前記要素間の差分とを含むデータを作成するデータ作成部とを備えることを特徴とする。 The data creation device of the present invention has a first base sequence storage unit that stores first base sequence data having a length of the first base sequence, and the first base sequence data stored in the first base sequence storage unit. Based on this, for each second base sequence data whose individual length is a second base number shorter than the first base number, the position of the partial sequence in the first base sequence data corresponding to the second base sequence data. The position recognition unit that recognizes the numerical value indicating the value and the numerical value indicating the position of the partial sequence of the first base sequence data corresponding to each second base sequence data are rearranged in ascending or descending order to create a position sequence. A replacement unit, a reference element recognition unit that recognizes a reference element that is at least one element included in the array at the position, and a difference recognition unit that recognizes a difference between adjacent elements included in the array at the position. It is characterized by including a data creation unit that creates data including a reference element recognized by the reference element recognition unit and a difference between the elements recognized by the difference recognition unit.

当該構成のデータ作成装置によれば、位置認識部により、第１塩基配列記憶部に記憶された前記第１塩基配列データに基づいて、個々の長さが前記第１塩基数よりも短い第２塩基数である各第２塩基配列データについて、当該第２塩基配列データに対応する第１塩基配列データ中の部分配列の位置を示す数値が認識される。 According to the data creation device having the above configuration, the position recognition unit stores the first base sequence data in the first base sequence storage unit, and the individual lengths of the second base sequence data are shorter than the first base sequence data. For each second base sequence data which is the number of bases, a numerical value indicating the position of the partial sequence in the first base sequence data corresponding to the second base sequence data is recognized.

そして、並替部により、各第２塩基配列データに対応する第１塩基配列データの部分配列の位置を示す数値を昇順または降順で並び替えることにより、各第２塩基配列データに対応する第１塩基配列データの部分配列の位置の配列が作成される。ここで、各第２塩基配列データに対応する第１塩基配列データの部分配列の位置の配列の隣り合う要素は、昇順または降順で並べられているので、その差分はかなり小さくなりやすい。特に、ゲノムシーケンサーによって高頻度に読み取られた塩基配列に関連する塩基配列データについては、それらの位置を示す数値は同一またはほとんど差がないものとなる。 Then, the rearrangement unit rearranges the numerical values indicating the positions of the partial sequences of the first base sequence data corresponding to each second base sequence data in ascending or descending order, so that the first base sequence data corresponding to each second base sequence data can be sorted. An array of the positions of the partial sequences of the base sequence data is created. Here, since the adjacent elements of the sequence at the position of the partial sequence of the first base sequence data corresponding to each second base sequence data are arranged in ascending order or descending order, the difference tends to be considerably small. In particular, for the base sequence data related to the base sequences frequently read by the genome sequencer, the numerical values indicating their positions are the same or almost the same.

そして、基準要素認識部により、前記位置の配列に含まれる少なくとも一つの要素である基準要素が認識される。 Then, the reference element recognition unit recognizes the reference element which is at least one element included in the array at the position.

そして、差分認識部により、前記位置の配列に含まれ、かつ、隣り合う要素間の差分が認識される。 Then, the difference recognition unit recognizes the difference between the elements included in the array at the position and adjacent to each other.

そして、データ作成部により、前記基準要素認識部により認識された基準要素と前記差分認識部により認識された前記要素間の差分とを含むデータが作成される。 Then, the data creation unit creates data including the reference element recognized by the reference element recognition unit and the difference between the elements recognized by the difference recognition unit.

前述したようにゲノムシーケンサーによって高頻度で読み取られた塩基配列に関する要素間の差分はかなり小さくなりやすいので、要素間の差分を示すデータのデータ容量は、小さく抑えられうる。 As described above, the difference between the elements related to the base sequence frequently read by the genome sequencer tends to be considerably small, so that the data capacity of the data indicating the difference between the elements can be kept small.

一方、作成されたデータに含まれる基準要素と要素間の差分とを用いれば、各第２塩基配列データに対応する第１塩基配列データの部分配列の位置を示す数値を逆算して求めることができる。このような各第２塩基配列データに対応する第１塩基配列データの部分配列の位置を示す数値は、対象のゲノムデータの内のどの部分の塩基配列がどの程度の頻度で読み取られているものかを示すこととなる。 On the other hand, if the reference element included in the created data and the difference between the elements are used, the numerical value indicating the position of the partial sequence of the first base sequence data corresponding to each second base sequence data can be calculated back. can. The numerical value indicating the position of the partial sequence of the first base sequence data corresponding to each of the second base sequence data is such that the base sequence of which part of the target genomic data is read at what frequency. Will be shown.

以上の通り、本発明のデータ作成装置によれば、データ容量を抑えながら、ゲノムシーケンサーによる読み取り頻度を再現できるデータを作成しうる。 As described above, according to the data creation apparatus of the present invention, it is possible to create data that can reproduce the reading frequency by the genome sequencer while suppressing the data capacity.

本発明のデータ作成装置において、前記基準要素認識部は、位置の配列に含まれる要素のうち最小の値の要素を基準要素として認識するように構成されていることが好ましい。 In the data creation device of the present invention, it is preferable that the reference element recognition unit is configured to recognize the element having the smallest value among the elements included in the position array as the reference element.

当該構成のデータ作成装置によれば、前記基準要素認識部により、位置の配列に含まれる要素のうち最小の値の要素が基準要素として認識される。これにより、基準要素を示すデータのデータ容量を小さく抑えることができるので、より圧縮率を向上させうる。 According to the data creation device having the configuration, the element having the smallest value among the elements included in the position array is recognized as the reference element by the reference element recognition unit. As a result, the data capacity of the data indicating the reference element can be suppressed to a small size, so that the compression rate can be further improved.

本発明のデータ作成装置において、前記データ作成部は、前記要素間の差分を示すデータとして、先行又は後続するデータが関連するデータであるか否かを示す第１部分と１４ビット以下のデータを格納する第２部分とを一又は複数含む可変長データを作成することが好ましい。 In the data creation device of the present invention, the data creation unit uses the first part indicating whether or not the preceding or succeeding data is related data and the data of 14 bits or less as the data indicating the difference between the elements. It is preferable to create variable length data including one or more second parts to be stored.

当該構成のデータ作成装置によれば、前記データ作成部により、前記要素間の差分を示すデータとして、先行又は後続のデータが関連するデータであるか否かを示す第１部分と１４ビット以下のデータを格納する第２部分とを一又は複数含む可変長データが作成される。 According to the data creation device having the configuration, the data creation unit uses the first part indicating whether or not the preceding or succeeding data is related data as the data indicating the difference between the elements, and 14 bits or less. Variable length data is created that includes one or more second parts that store the data.

本願の出願人が検討したところによれば、位置の配列の連続する要素間の差分は、ほとんど１４ビット以下で表すことができる。これにより、多くのデータについて、そのデータ容量を小さく抑えながら、各要素の差分を表現することが可能となる。 According to the applicants of the present application, the difference between consecutive elements of the array of positions can be represented by almost 14 bits or less. This makes it possible to express the difference between each element of a large amount of data while keeping the data capacity small.

また、先行又は後続するデータが関連データであるか否かを示す第１部分により適当な数の第２部分が前記要素間の差分を示すデータとして用いられることが示されることで、一の第２部分のビット数以上となる差分についても、当該可変長データで表現することができる。 Further, the first part indicating whether the preceding or succeeding data is related data indicates that an appropriate number of second parts are used as data indicating the difference between the elements. Differences that are equal to or greater than the number of bits in the two parts can also be expressed by the variable length data.

当該構成のデータ作成装置において、前記第２部分は、６ビット以下であることが好ましい。 In the data creation device having the configuration, the second portion is preferably 6 bits or less.

本件出願人の検討によれば、要素間の差分は、約８割のデータが６ビット以下で表現できることが分かった。 According to the examination of the applicant, it was found that about 80% of the data can be expressed by 6 bits or less as the difference between the elements.

第２部分のデータを６ビット以下で表現することにより、多くのデータについて、データ容量をさらに小さく抑えることが可能となる。一方、先行又は後続するデータが関連データであるか否かを示す第１部分により適当な数の第２部分が前記要素間の差分を示すデータとして用いられることが示されることで、一の第２部分のビット数以上となる差分についても、当該可変長データで表現することができる。 By expressing the data of the second part with 6 bits or less, it is possible to further reduce the data capacity for a large amount of data. On the other hand, the first part, which indicates whether the preceding or succeeding data is related data, indicates that an appropriate number of second parts are used as the data indicating the difference between the elements. Differences that are equal to or greater than the number of bits in the two parts can also be expressed by the variable length data.

当該構成のデータ作成装置において、前記第２部分は、３ビット以下であることが好ましい。 In the data creation device having the configuration, the second portion is preferably 3 bits or less.

本件出願人の検討によれば、約６割のデータが３ビット以下で表現できることが分かった。第２部分のデータを３ビット以下で表現することにより、多くのデータについて、データ容量をさらに小さく抑えることが可能となる。一方、先行又は後続するデータが関連データであるか否かを示す第１部分により適当な数の第２部分が前記要素間の差分を示すデータとして用いられることが示されることで、一の第２部分のビット数以上となる差分についても、当該可変長データで表現することができる。 According to the examination of the applicant, it was found that about 60% of the data can be expressed in 3 bits or less. By expressing the data of the second part with 3 bits or less, it is possible to further reduce the data capacity for a large amount of data. On the other hand, the first part, which indicates whether the preceding or succeeding data is related data, indicates that an appropriate number of second parts are used as the data indicating the difference between the elements. Differences that are equal to or greater than the number of bits in the two parts can also be expressed by the variable length data.

データ作成システムの全体構成図。Overall configuration diagram of the data creation system. 第１塩基配列データの一例を示す図。The figure which shows an example of the 1st base sequence data. ゲノムシーケンサーにより読み込まれた複数の第２塩基配列データの一例を示す図。The figure which shows an example of a plurality of 2nd base sequence data read by a genome sequencer. データ作成処理のフローチャート。Flowchart of data creation process. ＳＡＭ形式のファイルの一例を示す図。The figure which shows an example of the file of SAM format. 抽出後データの一例を示す図。The figure which shows an example of the data after extraction. 並替後データの一例を示す図。The figure which shows an example of the data after sorting. 差分認識後データの一例を示す図。The figure which shows an example of the data after difference recognition. データ作成処理によって作成されるデータに含まれる内容の一例を示す図。The figure which shows an example of the contents contained in the data created by the data creation process. データ作成処理によって作成されるデータの具体例を示す図。The figure which shows the specific example of the data created by the data creation process. データ作成処理によって作成されるデータの形式の一例を示す図。The figure which shows an example of the format of the data created by the data creation process. 一のデータ形式に従ったデータの表現を示す図。The figure which shows the representation of data according to one data format. 差分を表すのに必要なビット数と、各ビット数の頻度及び含有割合との関係を示すグラフ。A graph showing the relationship between the number of bits required to represent a difference and the frequency and content ratio of each bit number.

図１～図８を参照して、本発明の実施形態のデータ作成システムを説明する。 The data creation system of the embodiment of the present invention will be described with reference to FIGS. 1 to 8.

（データ作成システムの構成）
図１を参照して、データ作成システムの構成を説明する。(Configuration of data creation system)
The configuration of the data creation system will be described with reference to FIG.

データ作成システムは、１又は複数のゲノムシーケンサー１００と、１又は複数のデータ作成装置２００と、データベース３００と、を備える。 The data creation system includes one or more genome sequencers 100, one or more data creation devices 200, and a database 300.

１又は複数のデータ作成装置２００は、それぞれ、有線接続又は無線接続を介して、１又は複数のゲノムシーケンサー１００のそれぞれと接続されている。データベース３００は、インターネット等の広域ネットワークを介して、データ作成装置２００のそれぞれと接続されている。一又は複数のデータ作成装置２００は、それぞれ異なるユーザに使用されうる。 The one or more data generation devices 200 are connected to each of the one or more genomic sequencers 100 via a wired or wireless connection, respectively. The database 300 is connected to each of the data creation devices 200 via a wide area network such as the Internet. The one or more data creation devices 200 may be used by different users.

（ゲノムシーケンサーの構成）
ゲノムシーケンサー１００は、例えば、対象の生体Ｐからゲノム情報の一部を取得し、当該ゲノム情報に含まれる部分的な塩基配列を示すデータ（以下、「第２塩基配列データ」という。）を繰り返し出力するように構成されている。ゲノムシーケンサー１００は、例えばＨｉＳｅｑシステム（登録商標）で構成される。第２塩基配列データは、塩基記号（Ａ、Ｃ、Ｇ又はＴ）の繰り返しで表現される。ゲノムシーケンサー１００は、所定の設定またはユーザの指定にされた数だけ塩基記号が含まれるように、第２塩基配列データを読み取る。以下、第２塩基配列データに含まれる塩基記号の数を、適宜「第２塩基配列データの長さ」ともいう。第２塩基配列データは、塩基記号以外の符号、例えば、読取不能を示す記号としての「？」を含んでもよい。第２塩基配列データの長さが、本発明の「第２塩基数」の一例に該当する。(Construction of genome sequencer)
For example, the genome sequencer 100 acquires a part of the genomic information from the target living body P, and repeats data showing a partial base sequence included in the genome information (hereinafter, referred to as “second base sequence data”). It is configured to output. The genome sequencer 100 is composed of, for example, the HiSeq system (registered trademark). The second base sequence data is represented by repeating the base symbol (A, C, G or T). The genome sequencer 100 reads the second base sequence data so that the number of base symbols is included in the predetermined setting or the number specified by the user. Hereinafter, the number of base symbols included in the second base sequence data is also appropriately referred to as "length of the second base sequence data". The second base sequence data may include a code other than the base symbol, for example, "?" As a symbol indicating unreadable. The length of the second base sequence data corresponds to an example of the "second base number" of the present invention.

（データ作成装置の構成）
一又は複数のデータ作成装置２００は、細かくは個々の端末ごとに異なるけれども、概略的には以下のような構成を有する。(Configuration of data creation device)
The one or more data creating devices 200 have the following configurations, although they differ in detail for each terminal.

データ作成装置２００は、演算処理部２１０と、記憶部２２０とを備える。 The data creation device 200 includes an arithmetic processing unit 210 and a storage unit 220.

データ作成装置２００は、ラップトップコンピュータ、タブレット型端末またはスマートフォンなど、ユーザによる携帯が可能なようにサイズ、形状および重量が設計されているコンピュータにより構成されていてもよく、デスクトップコンピュータなど、特定箇所に設置されるように、サイズ、形状および重量が設計されているコンピュータにより構成されていてもよい。 The data creation device 200 may be composed of a computer whose size, shape and weight are designed so that it can be carried by a user, such as a laptop computer, a tablet terminal or a smartphone, and a specific location such as a desktop computer. It may be configured by a computer whose size, shape and weight are designed to be installed in.

演算処理部２１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）等の演算処理装置、メモリ等の記憶装置及びＩ／Ｏ（Ｉｎｐｕｔ／Ｏｕｔｐｕｔ）デバイスなどにより構成されている。記憶部２２０には、外部よりダウンロードしたデータ作成プログラム２２３がインストールされている。記憶部２２０に記憶されたデータ作成プログラム２２３が起動されることにより、演算処理部２１０は、位置認識部２１１と、並替部２１２と、基準要素認識部２１３と、差分認識部２１４と、データ作成部２１５ととして機能するように構成されている。なお、データ作成プログラム２２３を記憶したデータ作成装置２００が、本発明の「記憶媒体」の一例に相当する。 The arithmetic processing unit 210 is composed of an arithmetic processing device such as a CPU (Central Processing Unit), a storage device such as a memory, and an I / O (Input / Output) device. A data creation program 223 downloaded from the outside is installed in the storage unit 220. When the data creation program 223 stored in the storage unit 220 is activated, the arithmetic processing unit 210 includes a position recognition unit 211, a sort unit 212, a reference element recognition unit 213, a difference recognition unit 214, and data. It is configured to function as a creation unit 215. The data creation device 200 that stores the data creation program 223 corresponds to an example of the "storage medium" of the present invention.

演算処理部２１０は、有線通信またはＷｉＦｉ（登録商標）等の遠距離での無線通信に適した通信規格にしたがった無線通信を介して、データベース３００などの外部機器と相互通信するよう構成されている。 The arithmetic processing unit 210 is configured to communicate with an external device such as a database 300 via wired communication or wireless communication according to a communication standard suitable for long-distance wireless communication such as WiFi (registered trademark). There is.

記憶部２２０は、例えばＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ），ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）等の記憶装置により構成されている。 The storage unit 220 includes, for example, a storage device such as a ROM (Read Only Memory), a RAM (Random Access Memory), and an HDD (Hard Disk Drive).

記憶部２２０は、演算処理部２１０による演算処理及び演算処理部２１０が受信したデータなどの演算処理部２１０が認識した情報を記憶するように構成されている。 The storage unit 220 is configured to store information recognized by the arithmetic processing unit 210, such as arithmetic processing by the arithmetic processing unit 210 and data received by the arithmetic processing unit 210.

なお、一の装置が情報を「認識する」とは、一の装置が他の装置から当該情報を受信すること、一の装置が当該一の装置に接続された記憶媒体に記憶された情報を読み取ること、一の装置が当該一の装置に接続されたセンサから出力された信号に基づいて情報を取得すること、一の装置が、受信した情報又は記憶媒体に記憶された情報又はセンサから取得した情報に基づいて、所定の演算処理（計算処理又は検索処理など）を実行することにより当該情報を導出すること、一の装置が他の装置による演算処理結果としての当該情報を当該他の装置から受信すること、一の装置が当該受信信号にしたがって内部記憶装置又は外部記憶装置から当該情報を読み取ること等、当該情報を取得するためのあらゆる演算処理が実行されることを意味する。 Note that one device "recognizes" information means that one device receives the information from another device, and that one device receives information stored in a storage medium connected to the one device. Reading, one device acquiring information based on the signal output from the sensor connected to the one device, one device acquiring from the received information or information stored in a storage medium or sensor. Deriving the information by executing a predetermined arithmetic process (calculation process, search process, etc.) based on the information It means that all arithmetic processing for acquiring the information is executed, such as receiving from, one device reading the information from the internal storage device or the external storage device according to the received signal.

記憶部２２０は、第１塩基配列記憶部２２１とデータ記憶部２２２とを備える。 The storage unit 220 includes a first base sequence storage unit 221 and a data storage unit 222.

第１塩基配列記憶部２２１は、図２Ａに示されるように、塩基配列を示すデータ（以下、「第１塩基配列データ」という。）を格納している。これらのデータは、一又は複数の生体（ただし、「人類」又は「日本人」等のようにある程度共通項を有する生体）から読み取られた各塩基配列を示すデータを基に作成されうる。複数の生体から塩基配列を示すデータを作成された場合、第１塩基配列データは、各生体から読み取られた各塩基配列で共通の塩基についてはそのままの塩基記号で表され、それらの生体で異なる塩基については＊等の塩基記号とは異なる記号で表される。一の第１塩基配列データは、ｃｈｒ１，ｃｈｒ２など、複数の塩基配列に分解され、それぞれの塩基配列ごとに記憶されていてもよい。これらのｃｈｒ１，ｃｈｒ２などに分解された各塩基配列を、以下、適宜「リファレンス配列」という。また、ｃｈｒ１、ｃｈｒ２の各塩基配列を識別する文字列を、以下、適宜「リファレンス配列の名称」という。本実施例において、リファレンス配列の名称は、ｃｈｒ等の所定の文字列と、番号とで構成される。これらのリファレンス配列の長さは、第２塩基配列データの長さよりも長く設定される。リファレンス配列の長さの合計値が、本発明の「第１塩基数」の一例に相当する。 As shown in FIG. 2A, the first base sequence storage unit 221 stores data indicating the base sequence (hereinafter, referred to as “first base sequence data”). These data can be created based on the data showing each base sequence read from one or more living organisms (however, living organisms having some common items such as "human beings" or "Japanese"). When data showing a base sequence is created from a plurality of living organisms, the first base sequence data is represented by the same base symbol for the bases common to each base sequence read from each living body, and differs among those living bodies. The base is represented by a symbol different from the base symbol such as *. The first base sequence data may be decomposed into a plurality of base sequences such as chr1 and chr2 and stored for each base sequence. Each of these base sequences decomposed into chr1, chr2 and the like is hereinafter appropriately referred to as a "reference sequence". Further, the character string that identifies each of the base sequences of chr1 and chr2 is hereinafter appropriately referred to as "reference sequence name". In this embodiment, the name of the reference sequence is composed of a predetermined character string such as chr and a number. The length of these reference sequences is set longer than the length of the second base sequence data. The total value of the lengths of the reference sequences corresponds to an example of the "first number of bases" of the present invention.

第１塩基配列記憶部２２１は、生体の種別ごとに第１塩基配列データを記憶していてもよい。 The first base sequence storage unit 221 may store the first base sequence data for each type of living body.

なお、第１塩基配列データを作成するためのサンプルとなる生体は、後述するデータ作成処理の処理対象の生体Ｐと異なる生体である。ただし、生体の種別が共通していれば、個体が異なっても、そのほとんどの塩基配列は一致する。例えば、人類であれば、個体が異なっても、９９．９％程度の塩基配列が一致することとなる。 The living body as a sample for creating the first base sequence data is a living body different from the living body P to be processed in the data creation process described later. However, if the types of living organisms are common, most of the base sequences will be the same even if the individuals are different. For example, in the case of human beings, even if the individuals are different, about 99.9% of the base sequences will match.

（データベースの構成）
データベース３００は、ＣＰＵ等の演算処理装置、ローカルメモリ、ＲＯＭ，ＲＡＭ、ＨＤＤ等の記憶装置及びＩ／Ｏデバイスなどにより構成されている。データベース３００は、データ作成装置２００から受信したデータを記憶するように構成されている。データベース３００は、一のプロセッサにより構成されてもよく、相互通信可能な複数のプロセッサにより構成されてもよい。(Database configuration)
The database 300 is composed of an arithmetic processing unit such as a CPU, a storage device such as a local memory, a ROM, a RAM, and an HDD, and an I / O device. The database 300 is configured to store the data received from the data creation device 200. The database 300 may be configured by one processor or may be configured by a plurality of processors capable of intercommunication.

なお、データベース３００を構成するコンピュータの一部または全部が、データ作成装置２００を構成するコンピュータにより構成されていてもよい。たとえば、移動局としての一または複数のデータ作成装置２００により、データベース３００の一部または全部が構成されていてもよい。 A part or all of the computers constituting the database 300 may be configured by the computers constituting the data creation device 200. For example, a part or all of the database 300 may be configured by one or more data creation devices 200 as mobile stations.

また、データベース３００は、ＷｉＦｉ又は有線接続などを介してネットワークとしての公衆通信網（例えばインターネット）に接続され、外部の機器（例えばデータ作成装置２００）と通信するように構成されている。 Further, the database 300 is connected to a public communication network (for example, the Internet) as a network via WiFi or a wired connection, and is configured to communicate with an external device (for example, a data creation device 200).

（データ作成処理）
次に、図２～図８を参照して、データ作成装置２００により実行されるデータ作成処理の流れを説明する。(Data creation process)
Next, with reference to FIGS. 2 to 8, the flow of the data creation process executed by the data creation device 200 will be described.

位置認識部２１１は、ゲノムシーケンサー１００から出力されたデータに基づいて、対象の生体Ｐの各第２塩基配列データを認識する（図３／ＳＴＥＰ００２）。対象の生体Ｐは、ゲノムシーケンサー１００によってゲノム情報が読み取り可能な生体であればよく、例えば、人であっても良いし、動物であってもよいし、植物であってもよいし、微生物であってもよい。 The position recognition unit 211 recognizes each second base sequence data of the target living body P based on the data output from the genome sequencer 100 (FIG. 3 / STEP002). The target organism P may be any organism whose genomic information can be read by the genome sequencer 100, for example, a human being, an animal, a plant, or a microorganism. There may be.

ゲノムシーケンサー１００から出力されたデータは、例えば、図２Ｂに示されるように、塩基記号の繰り返しを含むデータＤ１である。 The data output from the genome sequencer 100 is, for example, data D1 including repetition of the base symbol, as shown in FIG. 2B.

データＤ１は、所定の塩基数（例えば５０）だけの塩基記号の繰り返しで示される複数の第２塩基配列データＤ１１、Ｄ１２、Ｄ１３を含む。各第２塩基配列データＤ１１、Ｄ１２，Ｄ１３は、例えばカンマで区切られている。また、各第２塩基配列データＤ１１、Ｄ１２，Ｄ１３は、読取不能であった塩基を示す補助塩基記号Ｄ１１１，Ｄ１２１，Ｄ１３１を含む。 The data D1 includes a plurality of second base sequence data D11, D12, D13 represented by repetition of base symbols for a predetermined number of bases (for example, 50). The second base sequence data D11, D12, and D13 are separated by, for example, a comma. In addition, each of the second base sequence data D11, D12, and D13 includes auxiliary base symbols D111, D121, and D131 indicating unreadable bases.

位置認識部２１１は、図３／ＳＴＥＰ００２で認識した各第２塩基配列データと、第１塩基配列記憶部２２１に格納された第１塩基配列データとを対比することにより、各第２塩基配列データに対応する第１塩基配列データにおける第１塩基配列データの部分配列の位置を示す数値を認識する（図３／ＳＴＥＰ００４）。 The position recognition unit 211 compares each second base sequence data recognized in FIG. 3 / STEP002 with the first base sequence data stored in the first base sequence storage unit 221 to obtain each second base sequence data. Recognizes the numerical value indicating the position of the partial sequence of the first base sequence data in the first base sequence data corresponding to (FIG. 3 / STEP004).

例えば、位置認識部２１１は、第２塩基配列データに含まれる各塩基記号の出現順が、一致している割合が最も高い第１塩基配列データの部分配列（当該第２塩基配列データに対応する第１塩基配列データの部分配列）を認識する。そして、位置認識部２１１は、第１塩基配列データにおける部分配列の開始位置を示す数値を認識する。部分配列の位置は、当該部分配列を特定するための位置であればよく、開始位置に限られず、例えば終了位置であってもよいし、その他の位置であってもよい。 For example, the position recognition unit 211 corresponds to a partial sequence of the first base sequence data (corresponding to the second base sequence data) in which the appearance order of each base symbol included in the second base sequence data has the highest matching ratio. The partial sequence of the first base sequence data) is recognized. Then, the position recognition unit 211 recognizes a numerical value indicating the start position of the partial sequence in the first base sequence data. The position of the partial array may be any position as long as it is a position for specifying the partial array, and is not limited to the start position, and may be, for example, an end position or another position.

このような、部分配列の位置を示す数値の認識については、種々の公知の手法が採用されうる。 Various known methods can be adopted for recognizing such numerical values indicating the positions of partial sequences.

位置認識部２１１は、ＳＡＭ（ＳｅｑｕｅｎｃｅＡｌｉｇｎｍｅｎｔ／Ｍａｐ）形式のファイルを作成する（図３／ＳＴＥＰ００６）。作成されたファイルは、記憶部２２０に記憶される。 The position recognition unit 211 creates a SAM (Sequence Alignment / Map) format file (FIG. 3 / STEP006). The created file is stored in the storage unit 220.

図４は、図３／ＳＴＥＰ００６で作成されるファイルの一例を示す図である。図４に示されるファイルは、ヘッダデータＤ２１とボディデータＤ２２とを含む。 FIG. 4 is a diagram showing an example of a file created in FIG. 3 / STEP006. The file shown in FIG. 4 includes header data D21 and body data D22.

ボディデータＤ２２は、各第２塩基配列データのそれぞれについて、リファレンス配列の名称Ｄ２２１、第２塩基配列データに対応する第１塩基配列データのリファレンス配列における部分配列の開始位置Ｄ２２２と、第２塩基配列データに対応する第１塩基配列データにおけるペアエンドの場合の部分配列の開始位置Ｄ２２３と、第２塩基配列データの塩基配列Ｄ２２４とを含む。なお、リファレンス配列の名称と、リファレンス配列における部分配列の開始位置とが、本発明の「第１塩基配列データの部分配列の位置」の一例に相当する。 The body data D22 has the name D221 of the reference sequence, the start position D222 of the partial sequence in the reference sequence of the first base sequence data corresponding to the second base sequence data, and the second base sequence for each of the second base sequence data. The start position D223 of the partial sequence in the case of the pair end in the first base sequence data corresponding to the data and the base sequence D224 of the second base sequence data are included. The name of the reference sequence and the start position of the partial sequence in the reference sequence correspond to an example of the "position of the partial sequence of the first base sequence data" of the present invention.

並替部２１２は、リファレンス配列ごとに、各第２塩基配列データに対応する第１塩基配列データにおける部分配列の開始位置を示す数値を抽出する（図３／ＳＴＥＰ００８）。 The sorting unit 212 extracts a numerical value indicating the start position of the partial sequence in the first base sequence data corresponding to each second base sequence data for each reference sequence (FIG. 3 / STEP008).

並替部２１２は、図３／ＳＴＥＰ００８の処理により、例えば、図５Ａに示される位置抽出後データＤ３を作成する。位置抽出後データＤ３は、各第２塩基配列データの塩基配列の長さＤ３１と、各リファレンス配列の名称Ｄ３２と、各リファレンス配列に対応付けられた第２塩基配列データの数Ｄ３３と、各第２塩基配列データに対応する第１塩基配列データにおける部分配列の開始位置Ｄ３４とを含む。各第２塩基配列データの塩基配列の長さＤ３１は、各第２塩基配列データの長さから認識されてもよい。また、第２塩基配列データの長さが予め決まっている場合には、各第２塩基配列データの塩基配列の長さＤ３１は、省略されてもよい。 The sorting unit 212 creates, for example, the post-position extraction data D3 shown in FIG. 5A by the process of FIG. 3 / STEP008. The position-extracted data D3 includes the length D31 of the base sequence of each second base sequence data, the name D32 of each reference sequence, the number D33 of the second base sequence data associated with each reference sequence, and each second. It includes the start position D34 of the partial sequence in the first base sequence data corresponding to the two base sequence data. The base sequence length D31 of each second base sequence data may be recognized from the length of each second base sequence data. When the length of the second base sequence data is predetermined, the length D31 of the base sequence of each second base sequence data may be omitted.

図５Ａに示される位置抽出後データＤ３においては、５行目以降が、各第２塩基配列データに対応する第１塩基配列データにおける部分配列の開始位置となっている。 In the position-extracted data D3 shown in FIG. 5A, the fifth and subsequent rows are the start positions of the partial sequences in the first base sequence data corresponding to the second base sequence data.

図５Ａに示される位置抽出後データＤ３においては、３行目以降は、カンマ区切りで、２行目のリファレンス配列の名称Ｄ３２のそれぞれに対応するデータが格納されている。 In the position-extracted data D3 shown in FIG. 5A, the data corresponding to each of the names D32 of the reference sequence in the second row are stored in the third and subsequent rows separated by commas.

例えば、３行目の最初の「７１９７８６」は、リファレンス配列「ｃｈｒ１」に対応付けられた第２塩基配列データの数を示す。 For example, the first "719786" in the third row indicates the number of second base sequence data associated with the reference sequence "chr1".

また、３行目の二番目の「３８０９１２」は、リファレンス配列「ｃｈｒ２」に対応付けられた第２塩基配列データの数を示す。 Further, the second "380912" in the third row indicates the number of the second base sequence data associated with the reference sequence "chr2".

また、４行目の最初の「１７７６４４８６０」は、リファレンス配列「ｃｈｒ１」に対応付けられた第２塩基配列データのうち、ある第２塩基配列データに対応する第１塩基配列データにおける部分配列の開始位置を示す数値である。 Further, the first "177644860" in the fourth row is the start of a partial sequence in the first base sequence data corresponding to a certain second base sequence data among the second base sequence data associated with the reference sequence "chr1". It is a numerical value indicating the position.

また、５行目の最初の「１７７６４４８９６」は、リファレンス配列「ｃｈｒ１」に対応付けられた第２塩基配列データのうち、別の第２塩基配列データに対応する第１塩基配列データにおける部分配列の開始位置を示す数値である。 Further, the first "177644896" in the fifth row is a partial sequence of the first base sequence data corresponding to another second base sequence data among the second base sequence data associated with the reference sequence "chr1". It is a numerical value indicating the start position.

対応する開始位置がない場合は、空欄となる。 If there is no corresponding start position, it will be blank.

並替部２１２は、対応付けられたリファレンス配列ごとに、各第２塩基配列データに対応する第１塩基配列データにおける部分配列の開始位置に代えて、第２塩基配列データに対応する第１塩基配列データにおけるペアエンドの場合の部分配列の開始位置を示す数値を抽出してもよい。 The rearrangement unit 212 replaces the start position of the partial sequence in the first base sequence data corresponding to each second base sequence data for each associated reference sequence with the first base corresponding to the second base sequence data. A numerical value indicating the start position of the partial array in the case of a pair end in the array data may be extracted.

並替部２１２は、リファレンス配列ごとに、開始位置を示す数値を昇順で並び替える（図３／ＳＴＥＰ０１０）。 The sorting unit 212 sorts the numerical values indicating the start positions in ascending order for each reference array (FIG. 3 / STEP010).

並替部２１２は、図３／ＳＴＥＰ０１０の処理の後、図５Ｂに示されるような並替後データＤ４を作成する。並替後データＤ４は、各第２塩基配列データの塩基配列の長さＤ４１と、各リファレンス配列の名称Ｄ４２と、各リファレンス配列に対応付けられた第２塩基配列データの数Ｄ４３と、各第２塩基配列データに対応する第１塩基配列データにおける部分配列の開始位置Ｄ４４、Ｄ４５，Ｄ４６とを含む。各第２塩基配列データに対応する第１塩基配列データにおける部分配列の開始位置Ｄ４４、Ｄ４５，Ｄ４６は、昇順で並び替えられている。このため、これらのうちの一番上の行（並替後データＤ４における４行目）のデータＤ４４が、各リファレンス配列で最小の要素（開始位置）となっている。 After the processing of FIG. 3 / STEP010, the sorting unit 212 creates the post-sorting data D4 as shown in FIG. 5B. The rearranged data D4 includes the length D41 of the base sequence of each second base sequence data, the name D42 of each reference sequence, the number D43 of the second base sequence data associated with each reference sequence, and each second. The start positions D44, D45, and D46 of the partial sequence in the first base sequence data corresponding to the two base sequence data are included. The start positions D44, D45, and D46 of the partial sequence in the first base sequence data corresponding to each second base sequence data are rearranged in ascending order. Therefore, the data D44 in the top row (fourth row in the sorted data D4) is the smallest element (start position) in each reference sequence.

基準要素認識部２１３は、リファレンス配列ごとに、一又は複数の基準要素を認識する（図３／ＳＴＥＰ０１２）。基準要素は、例えば、各リファレンス配列で最小の要素である。最小の要素以外の任意の要素が基準要素として認識されてもよい。また、位置のリファレンス配列について、複数の要素が基準要素として認識されてもよい。 The reference element recognition unit 213 recognizes one or a plurality of reference elements for each reference sequence (FIG. 3 / STEP012). The reference element is, for example, the smallest element in each reference sequence. Any element other than the smallest element may be recognized as a reference element. Further, a plurality of elements may be recognized as reference elements in the position reference array.

差分認識部２１４は、基準要素以外の要素について、隣り合う要素との差分の配列を認識する（図３／ＳＴＥＰ０１４）。差分認識部２１４は、図３／ＳＴＥＰ０１４の処理後に、例えば図５Ｃに示される差分認識後データＤ５を作成する。 The difference recognition unit 214 recognizes an array of differences between adjacent elements for elements other than the reference element (FIG. 3 / STEP014). The difference recognition unit 214 creates, for example, the difference recognition post-data D5 shown in FIG. 5C after the processing of FIG. 3 / STEP014.

差分認識後データＤ５には、各第２塩基配列データの塩基配列の長さＤ５１と、各リファレンス配列の名称Ｄ５２と、各リファレンス配列に対応付けられた第２塩基配列データの数Ｄ５３と、各リファレンス配列の基準要素Ｄ５４と、差分データＤ５５、Ｄ５６とが含まれている。 The difference recognition data D5 includes the length D51 of the base sequence of each second base sequence data, the name D52 of each reference sequence, the number D53 of the second base sequence data associated with each reference sequence, and each. The reference element D54 of the reference sequence and the difference data D55 and D56 are included.

例えば、図５Ｂに示される並替後データＤ４では、リファレンス配列ｃｈｒ１に含まれる部分配列の開始位置は、小さい順に、９９９７、９９９８、９９９８・・・である。 For example, in the rearranged data D4 shown in FIG. 5B, the start positions of the partial sequences included in the reference sequence chr1 are 9997, 9998, 9998, ... In ascending order.

図５Ｃに示される差分認識後データＤ５の第４行目（符号Ｄ５４で示される行）には、リファレンス配列ｃｈｒ１における図３／ＳＴＥＰ０１２で認識された基準要素９９９７が含まれている。 The fourth row (row represented by reference numeral D54) of the difference recognition data D5 shown in FIG. 5C contains the reference element 9997 recognized in FIG. 3 / STEP012 in the reference sequence chr1.

また、図５Ｃに示される差分認識後データＤ５の第５行目（符号Ｄ５５で示される行）のリファレンス配列ｃｈｒ１に対応する要素に、図５Ｂの５行目（符号Ｄ４５で示される行）の要素９９９８と、その前（４行目（符号Ｄ４４で示される行））の要素９９９７との差分である１が含まれている。 Further, the element corresponding to the reference array chr1 in the fifth row (row represented by reference numeral D55) of the difference recognition data D5 shown in FIG. 5C is the fifth row (row indicated by reference numeral D45) in FIG. 5B. It contains 1 which is a difference between the element 9998 and the element 9997 before it (the fourth line (the line indicated by the reference numeral D44)).

また、図５Ｃに示される差分認識後データＤ５の第６行目（符号Ｄ５５で示される行）のリファレンス配列ｃｈｒ１に対応する要素に、図５Ｂの６行目（符号Ｄ４６で示される行）の要素９９９８と、その前（５行目（符号Ｄ４５で示される行））の要素９９９８との差分である０が含まれている。 Further, the element corresponding to the reference sequence chr1 in the sixth row (row indicated by reference numeral D55) of the difference recognition data D5 shown in FIG. 5C is the element corresponding to the sixth row (row indicated by reference numeral D46) in FIG. 5B. It contains 0, which is the difference between the element 9998 and the element 9998 before it (the fifth line (the line indicated by the reference numeral D45)).

また、例えば、図５Ｂに示される並替後データＤ４では、リファレンス配列ｃｈｒ２に含まれる部分配列の開始位置は、小さい順に、１０２３７、１０２８６、１０３３０・・・である。 Further, for example, in the rearranged data D4 shown in FIG. 5B, the start positions of the partial sequences included in the reference sequence chr2 are 10237, 10286, 10330, ... In ascending order.

図５Ｃに示される差分認識後データＤ５の第４行目（符号Ｄ５４で示される行）には、リファレンス配列ｃｈｒ２における図３／ＳＴＥＰ０１２で認識された基準要素１０２３７が含まれている。 The fourth row (row represented by reference numeral D54) of the difference recognition data D5 shown in FIG. 5C contains the reference element 10237 recognized in FIG. 3 / STEP012 in the reference sequence chr2.

また、図５Ｃに示される差分認識後データＤ５の第５行目（符号Ｄ５５で示される行）のリファレンス配列ｃｈｒ２に対応する要素に、図５Ｂの５行目（符号Ｄ４５で示される行）の要素１０２８６と、その前（４行目（符号Ｄ４４で示される行））の要素１０２３７との差分である４９が含まれている。 Further, the element corresponding to the reference array chr2 in the fifth row (row indicated by reference numeral D55) of the difference recognition data D5 shown in FIG. 5C is the fifth row (row indicated by reference numeral D45) in FIG. 5B. 49 is included, which is the difference between the element 10286 and the element 10237 before it (the fourth line (the line indicated by the reference numeral D44)).

また、図５Ｃに示される差分認識後データＤ５の第６行目（符号Ｄ５５で示される行）のリファレンス配列ｃｈｒ２に対応する要素に、図５Ｂの６行目（符号Ｄ４６で示される行）の要素１０３３０と、その前（５行目（符号Ｄ４５で示される行））の要素１０２８６との差分である４４が含まれている。 Further, the element corresponding to the reference array chr2 in the sixth row (row represented by reference numeral D55) of the difference recognition data D5 shown in FIG. 5C is the sixth row (row indicated by reference numeral D46) in FIG. 5B. 44 is included, which is the difference between the element 10330 and the element 10286 before it (the fifth line (the line indicated by the reference numeral D45)).

データ作成部２１５は、図３／ＳＴＥＰ０１２で認識された基準要素と、図３／ＳＴＥＰ０１４で認識された隣り合う要素との差分とを含むデータを作成する（図３／ＳＴＥＰ０１６）。 The data creation unit 215 creates data including the difference between the reference element recognized in FIG. 3 / STEP012 and the adjacent element recognized in FIG. 3 / STEP014 (FIG. 3 / STEP016).

例えば、データ作成部２１５は、図５Ｃに示される差分認識後データＤ５に基づいて、図３／ＳＴＥＰ０１６で、リファレンス配列ごとに、図６Ａに示されるようなデータＤ６１を作成する。データＤ６１は、第２塩基配列データの塩基配列の長さＤ６１と、当該リファレンス配列の名称に含まれる番号Ｄ６２と、当該リファレンス配列に対応付けられた第２塩基配列データの数Ｄ６３と、基準要素Ｄ６４と、差分Ｄ６５、Ｄ６６の配列とを含むデータである。 For example, the data creation unit 215 creates the data D61 as shown in FIG. 6A for each reference sequence in FIG. 3 / STEP016 based on the difference recognition data D5 shown in FIG. 5C. The data D61 includes the length D61 of the base sequence of the second base sequence data, the number D62 included in the name of the reference sequence, the number D63 of the second base sequence data associated with the reference sequence, and the reference element. This is data including D64 and an array of differences D65 and D66.

図３／ＳＴＥＰ０１６で作成されるデータは、少なくとも差分を示すデータ部分に関しては、図７Ａに示されるように、第１部分Ｄ１と、第２部分Ｄ２とを含む形式のデータとなっている。 The data created in FIG. 3 / STEP016 is in a format including the first portion D1 and the second portion D2, as shown in FIG. 7A, at least for the data portion showing the difference.

この第２部分Ｄ２は、何ビットでもよいが、１４ビット以下であることが好ましいが、６ビット以下であることがより好ましく、３ビット以下であることがさらに好ましい。 The second portion D2 may have any number of bits, but is preferably 14 bits or less, more preferably 6 bits or less, and further preferably 3 bits or less.

第１部分Ｄ１は、先行又は後続するデータが関連するデータであるか否かを示す部分である。第２部分Ｄ２は、差分等の対象のデータの内容を示す部分である。第１部分Ｄ１は、例えば、１ビットで構成されていてもよい。 The first part D1 is a part indicating whether or not the preceding or succeeding data is related data. The second part D2 is a part showing the contents of the target data such as the difference. The first portion D1 may be composed of, for example, one bit.

第１部分Ｄ１が１ビットで構成される場合、例えば、第１部分が０の場合、後続する所定の長さのデータが関連しないことを意味し、第１部分が１の場合、後続する所定の長さのデータが関連することを意味してもよいが、第１部分により読み込む範囲が特定できれば、どのような規則であってもよい。 When the first part D1 is composed of 1 bit, for example, when the first part is 0, it means that the data of the following predetermined length is not related, and when the first part is 1, the succeeding predetermined It may mean that the data of the length of is related, but any rule may be used as long as the range to be read can be specified by the first part.

例えば、図７Ｂに示されるデータは、第１部分が１ビットで、第２部分が３ビットで構成された場合の例を示している。図７Ｂに示されるデータは、第１部分が０の場合、後続する所定の長さのデータが関連しないことを意味し、第１部分が１の場合、後続する所定の長さのデータが関連することを意味する。 For example, the data shown in FIG. 7B shows an example in which the first part is composed of 1 bit and the second part is composed of 3 bits. The data shown in FIG. 7B means that when the first part is 0, the subsequent data of a predetermined length is not related, and when the first part is 1, the subsequent data of a predetermined length is related. Means to do.

第２部分が３ビットである場合、１０進数の１～７については、３ビットで十分に表現できるため後続するデータを使用する必要はない。このため、１０進数の１、３について、図７Ｂに示されるように、第１部分は０となる。また、１０進数の１、３については、図７Ｂに示されるように、第２部分は、それぞれ００１、０１１となる。 When the second part is 3 bits, it is not necessary to use the following data because the decimal numbers 1 to 7 can be sufficiently expressed by 3 bits. Therefore, for decimal numbers 1 and 3, the first part is 0 as shown in FIG. 7B. Further, for decimal numbers 1 and 3, as shown in FIG. 7B, the second part is 001 and 011, respectively.

一方、１０進数の８～３１については、３ビットでは十分に表現できない。このため、これらのデータについては、図７Ｂに示されるように、最初のデータの第１部分は１となる。しかし、６ビットであれば十分に１０進数の８～３１を表現できるため、これらのデータについては、図７Ｂに示されるように、次のデータの第１部分は０となる。これらについては、関連する第２部分全体により、差分等の対象のデータの内容が示される。例えば、８であれば、図７Ｂに示されるように、最初の第２部分の００１と、次の第２部分の０００とを合わせた、００１０００により、２進数の８が表現される。 On the other hand, the decimal numbers 8 to 31 cannot be sufficiently expressed by 3 bits. Therefore, for these data, as shown in FIG. 7B, the first part of the first data is 1. However, since 6 bits can sufficiently represent decimal numbers 8 to 31, the first part of the next data is 0 for these data, as shown in FIG. 7B. For these, the contents of the target data such as differences are shown by the entire related second part. For example, in the case of 8, as shown in FIG. 7B, the binary number 8 is represented by 001000, which is the sum of 001 of the first second part and 000 of the next second part.

第２部分の大きさは、対象のデータのサイズ解析することで、最適化しうる。 The size of the second part can be optimized by analyzing the size of the target data.

図６Ｂは、このような第１部分と第２部分とを含むデータ形式で図６Ａに示されるデータを作成した時の例を示す図である。 FIG. 6B is a diagram showing an example when the data shown in FIG. 6A is created in a data format including such a first portion and a second portion.

図６Ｂにおいては、当該リファレンス配列の名称に含まれる番号と、当該リファレンス配列に対応付けられた第２塩基配列データの総数と、基準要素と、各差分とが上記したデータ形式で表現されている。図３／ＳＴＥＰ０１６で作成されるデータには、図６Ｂに示されるデータが、リファレンス配列の数だけ繰り返し含まれている。当該リファレンス配列に対応付けられた第２塩基配列データの総数は、このリファレンス配列ごとの区切りを示すために用いられる。 In FIG. 6B, the number included in the name of the reference sequence, the total number of the second base sequence data associated with the reference sequence, the reference element, and each difference are represented in the above-mentioned data format. .. The data created in FIG. 3 / STEP016 contains the data shown in FIG. 6B repeatedly as many as the number of reference sequences. The total number of the second base sequence data associated with the reference sequence is used to indicate the delimiter for each reference sequence.

データ作成部２１５は、作成したデータをバイナリ形式でデータ記憶部２２２に記憶するとともに、データベース３００に送信する。データベース３００は、データ作成装置２００又は対象の生体Ｐを特定できる情報（例えばユーザＩＤなど）とともに受信したデータを記憶する。データ作成部２１５は、データベース３００に、リファレンス配列と基準要素とを除外したデータを送信してもよい。このようにすることで、データベース３００に記憶されたデータからは、全てのデータが復元できなくなるので、個人情報の保護が図られうる。 The data creation unit 215 stores the created data in the data storage unit 222 in a binary format and transmits the created data to the database 300. The database 300 stores the received data together with the information that can identify the data creation device 200 or the target living body P (for example, a user ID). The data creation unit 215 may send the data excluding the reference sequence and the reference element to the database 300. By doing so, all the data cannot be restored from the data stored in the database 300, so that personal information can be protected.

以上により、データ作成処理が終了する。 This completes the data creation process.

（データの復元）
図３／ＳＴＥＰ０１６で作成されるデータから、データの復元をする方法について説明する。以下の処理は、第１塩基配列データにアクセス可能な一般的なコンピュータにより実行されうる。(Data restoration)
A method of restoring data from the data created in FIG. 3 / STEP016 will be described. The following processing can be performed by a general computer having access to the first base sequence data.

まず、第１ステップにおいて、コンピュータは、図３／ＳＴＥＰ０１６で作成されるデータを先頭から読み込み、各第２塩基配列データの塩基配列の長さと、一のリファレンス配列の名称に含まれる番号と、当該リファレンス配列に対応付けられた第２塩基配列データの総数とを認識する。 First, in the first step, the computer reads the data created in FIG. 3 / STEP016 from the beginning, the length of the base sequence of each second base sequence data, the number included in the name of one reference sequence, and the said. Recognize the total number of second base sequence data associated with the reference sequence.

次に、第２ステップにおいて、コンピュータは、基準要素を認識する。 Next, in the second step, the computer recognizes the reference element.

第３ステップにおいて、コンピュータは、一のリファレンス配列の名称に含まれる番号と、基準要素とから、基準要素に対応する第１塩基配列データの部分配列の開始位置を認識できる。コンピュータは、当該部分配列の開始位置と各第２塩基配列データの塩基配列の長さとに基づいて、基準要素に対応する第１塩基配列データの部分配列を認識することができる。また、コンピュータは、当該リファレンス配列に対応付けられた第２塩基配列データの総数から１を引く。 In the third step, the computer can recognize the start position of the partial sequence of the first base sequence data corresponding to the reference element from the number included in the name of one reference sequence and the reference element. The computer can recognize the partial sequence of the first base sequence data corresponding to the reference element based on the start position of the partial sequence and the length of the base sequence of each second base sequence data. Further, the computer subtracts 1 from the total number of the second base sequence data associated with the reference sequence.

第４ステップにおいて、コンピュータは、基準要素の次の差分を読み込む。コンピュータは、基準要素に当該差分を加えることで、２番目の要素の値を認識する。コンピュータは、この値に基づき、２番目の要素に対応する第１塩基配列データの部分配列の開始位置を認識できる。コンピュータは、当該部分配列の開始位置と各第２塩基配列データの塩基配列の長さとに基づいて、２番目の要素に対応する第１塩基配列データの部分配列を認識することができる。また、コンピュータは、当該リファレンス配列に対応付けられた第２塩基配列データの総数から１を引く。 In the fourth step, the computer reads the next difference of the reference element. The computer recognizes the value of the second element by adding the difference to the reference element. Based on this value, the computer can recognize the start position of the partial sequence of the first base sequence data corresponding to the second element. The computer can recognize the partial sequence of the first base sequence data corresponding to the second element based on the start position of the partial sequence and the length of the base sequence of each second base sequence data. Further, the computer subtracts 1 from the total number of the second base sequence data associated with the reference sequence.

第５ステップにおいて、コンピュータは、その次の差分を読み込む。コンピュータは、２番目の要素の値に当該差分を加えることで、３番目の要素の値を認識する。コンピュータは、この値に基づき、３番目の要素に対応する第１塩基配列データの部分配列の開始位置を認識できる。コンピュータは、当該部分配列の開始位置と各第２塩基配列データの塩基配列の長さとに基づいて、３番目の要素に対応する第１塩基配列データの部分配列を認識することができる。また、コンピュータは、当該リファレンス配列に対応付けられた第２塩基配列データの総数から１を引く。 In the fifth step, the computer reads the next difference. The computer recognizes the value of the third element by adding the difference to the value of the second element. Based on this value, the computer can recognize the start position of the partial sequence of the first base sequence data corresponding to the third element. The computer can recognize the partial sequence of the first base sequence data corresponding to the third element based on the start position of the partial sequence and the length of the base sequence of each second base sequence data. Further, the computer subtracts 1 from the total number of the second base sequence data associated with the reference sequence.

当該リファレンス配列に対応付けられた第２塩基配列データの総数がゼロになるまで、コンピュータは、第５ステップを繰り返す。第２塩基配列データの総数がゼロとなった場合、データの読み込みが完了するまで、コンピュータは、第１ステップ～第５ステップを繰り返し実行する。 The computer repeats the fifth step until the total number of the second base sequence data associated with the reference sequence becomes zero. When the total number of the second base sequence data becomes zero, the computer repeatedly executes the first step to the fifth step until the reading of the data is completed.

このようにすることで、コンピュータは、各第２塩基配列データに対応する第１塩基配列データの部分配列の群を認識することができる。この各第２塩基配列データに対応する第１塩基配列データの部分配列の群は、各第２塩基配列データとは完全には一致しないが、生体Ｐのゲノムシーケンサーによる読取頻度の解析をする上では十分に有用である。 By doing so, the computer can recognize a group of partial sequences of the first base sequence data corresponding to each second base sequence data. The group of partial sequences of the first base sequence data corresponding to each of the second base sequence data does not completely match each second base sequence data, but it is necessary to analyze the reading frequency by the genome sequencer of the living body P. Is useful enough.

（本実施形態の作用効果）
当該構成のデータ作成装置２００によれば、位置認識部２１１により、第１塩基配列記憶部２２１に記憶された第１塩基配列データに基づいて、個々の長さが第１塩基数よりも短い第２塩基数である各第２塩基配列データについて、当該第２塩基配列データＤ２２４に対応する第１塩基配列データ中の部分配列の位置Ｄ２２１、Ｄ２２２が認識される（図３／ＳＴＥＰ００４、図３／ＳＴＥＰ００６）。(Action and effect of this embodiment)
According to the data creation device 200 having the configuration, each length is shorter than the number of first bases based on the first base sequence data stored in the first base sequence storage unit 221 by the position recognition unit 211. For each second base sequence data having 2 bases, the positions D221 and D222 of the partial sequence in the first base sequence data corresponding to the second base sequence data D224 are recognized (FIG. 3 / STEP004, FIG. 3 /. STEP006).

そして、並替部２１２により、各第２塩基配列データに対応する第１塩基配列データの部分配列の位置を昇順または降順で並び替えることにより（図３／ＳＴＥＰ０１０）、各第２塩基配列データに対応する第１塩基配列データの部分配列の位置の配列（図５Ｂの第４行目以降）が作成される（図５Ｂ参照）。ここで、各第２塩基配列データに対応する第１塩基配列データの部分配列の位置の配列の隣り合う要素は、互いに近い位置となるので、その差分はかなり小さくなりやすい。特に、高頻度に読み取られた塩基配列に関連する塩基配列データについては、それらの位置は同一またはほとんど差がないものとなる。 Then, by rearranging the positions of the partial sequences of the first base sequence data corresponding to each second base sequence data in ascending or descending order by the rearrangement unit 212 (FIG. 3 / STEP010), each second base sequence data can be obtained. A sequence of the positions of the partial sequences of the corresponding first base sequence data (from the fourth row in FIG. 5B) is created (see FIG. 5B). Here, since the adjacent elements of the sequence at the position of the partial sequence of the first base sequence data corresponding to each second base sequence data are located close to each other, the difference tends to be considerably small. In particular, for the base sequence data related to the base sequence read frequently, their positions are the same or almost the same.

そして、基準要素認識部２１３により、前記位置の配列に含まれる少なくとも一つの位置である基準要素が認識される（図３／ＳＴＥＰ０１２）。 Then, the reference element recognition unit 213 recognizes the reference element which is at least one position included in the array of the positions (FIG. 3 / STEP012).

そして、差分認識部２１４により、位置の配列の隣り合う要素間の差分の配列が認識される（図３／ＳＴＥＰ０１４）。 Then, the difference recognition unit 214 recognizes the array of differences between adjacent elements of the array of positions (FIG. 3 / STEP014).

そして、データ作成部２１５により、基準要素認識部２１３により認識された基準要素と差分認識部２１４により認識された要素間の差分とを含むデータＤ６が作成される（図３／ＳＴＥＰ０１６）。 Then, the data creation unit 215 creates data D6 including the reference element recognized by the reference element recognition unit 213 and the difference between the elements recognized by the difference recognition unit 214 (FIG. 3 / STEP016).

要素間の差分は、前述したように高頻度で読み取られた部分についてはかなり小さくなりやすいので、要素間の差分を示すデータのデータ容量は、小さく抑えられうる。 Since the difference between the elements tends to be considerably small for the portion read frequently as described above, the data capacity of the data indicating the difference between the elements can be kept small.

例えば、本発明者らが実験したところによると、図３／ＳＴＥＰ０１６で作成されたデータのサイズは、図３／ＳＴＥＰ００６で作成されたＳＡＭファイルのサイズの約０．３３％となった。また、図３／ＳＴＥＰ０１６で作成されたデータのサイズは、開始位置を示す数値を抜き出した図５Ａのファイルのサイズと比較しても、約４．９７％となった。 For example, according to the experiments conducted by the present inventors, the size of the data created in FIG. 3 / STEP016 was about 0.33% of the size of the SAM file created in FIG.3 / STEP006. Further, the size of the data created in FIG. 3 / STEP016 was about 4.97% even when compared with the size of the file in FIG. 5A from which the numerical value indicating the start position was extracted.

一方、作成されたデータに含まれる基準要素と要素間の差分とを用いれば、各第２塩基配列データに対応する第１塩基配列データの部分配列の位置を逆算して求めることができる。このような各第２塩基配列データに対応する第１塩基配列データの部分配列の位置を示す数値は、対象のゲノムデータの内のどの部分の塩基配列がどの程度の頻度で読み取られているものかを示すこととなる。 On the other hand, if the reference element included in the created data and the difference between the elements are used, the position of the partial sequence of the first base sequence data corresponding to each second base sequence data can be calculated back. The numerical value indicating the position of the partial sequence of the first base sequence data corresponding to each of the second base sequence data is such that the base sequence of which part of the target genomic data is read at what frequency. Will be shown.

以上の通り、本発明のデータ作成装置２００によれば、データ容量を抑えながら、ゲノムシーケンサー１００による読み取り頻度を再現できるデータを作成しうる。 As described above, according to the data creation device 200 of the present invention, it is possible to create data that can reproduce the reading frequency by the genome sequencer 100 while suppressing the data capacity.

また、当該構成のデータ作成装置２００によれば、基準要素認識部２１３により、位置の配列に含まれる要素のうち最小の値の要素が基準要素として認識される（図３／ＳＴＥＰ０１２）。これにより、基準要素を示すデータのデータ容量を小さく抑えることができるので、より圧縮率を向上させうる。 Further, according to the data creation device 200 having the above configuration, the reference element recognition unit 213 recognizes the element having the smallest value among the elements included in the position array as the reference element (FIG. 3 / STEP012). As a result, the data capacity of the data indicating the reference element can be suppressed to a small size, so that the compression rate can be further improved.

当該構成のデータ作成装置２００によれば、データ作成部２１５により、要素間の差分を示すデータとして、先行又は後続のデータが関連するデータであるか否かを示す第１部分Ｄ６１と１４ビット以下のデータを格納する第２部分Ｄ６２とを一又は複数含む可変長データＤ６（図７Ａ参照）が作成される（図３／ＳＴＥＰ０１６）。 According to the data creation device 200 having the configuration, the data creation unit 215 uses the first portion D61 indicating whether or not the preceding or succeeding data is related data as the data indicating the difference between the elements, and 14 bits or less. Variable length data D6 (see FIG. 7A) including one or more of the second portion D62 for storing the data of the above is created (FIG. 3 / STEP016).

本願の出願人が検討したところによれば、位置の配列の連続する要素間の各差分は、ほとんど１４ビット以下で表すことができる。 According to the applicants of the present application, each difference between consecutive elements of the array of positions can be represented by almost 14 bits or less.

例えば、図８は、ある生体（人間）から取得されたデータに基づいて作成されたグラフ理であり、差分が何ビットで表せるかを示したグラフである。図８のグラフの横軸は、差分が何ビットで表せるかを示す軸である。図８の左軸は、各ビットの出現頻度である。図８の右軸は、各ビットの出現頻度の割合を累計した割合である。図８に示されるように、各ビットの出現頻度の割合を累計した割合は、１４ビットでほぼ１００％となる。このため、第２部分は、１４ビット以下であることが好ましい。 For example, FIG. 8 is a graph created based on data acquired from a certain living body (human), and is a graph showing how many bits the difference can be represented. The horizontal axis of the graph of FIG. 8 is an axis indicating how many bits the difference can be represented. The left axis of FIG. 8 is the frequency of appearance of each bit. The right axis of FIG. 8 is the cumulative ratio of the appearance frequency of each bit. As shown in FIG. 8, the cumulative ratio of the appearance frequency of each bit is 14 bits, which is almost 100%. Therefore, the second portion is preferably 14 bits or less.

これにより、多くのデータについて、そのデータ容量を小さく抑えながら、各要素の差分を表現することが可能となる。 This makes it possible to express the difference between each element of a large amount of data while keeping the data capacity small.

また、先行又は後続するデータが関連データであるか否かを示す第１部分により適当な数の第２部分が前記要素間の差分を示すデータとして用いられることが示されることで、一の第２部分のビット数以上となる差分についても、図７Ｂに示されるように、当該可変長データで表現することができる。 Further, the first part indicating whether the preceding or succeeding data is related data indicates that an appropriate number of second parts are used as data indicating the difference between the elements. As shown in FIG. 7B, a difference having a number of bits or more in two parts can also be expressed by the variable length data.

また、図８に示されるように、各ビットの出現頻度の割合を累計した割合は、６ビットでほぼ８０％となる。このため、第２部分は、６ビット以下であってもよい。 Further, as shown in FIG. 8, the cumulative ratio of the appearance frequency of each bit is about 80% for 6 bits. Therefore, the second part may be 6 bits or less.

また、図８に示されるように、各ビットの出現頻度の割合を累計した割合は、３ビットでほぼ６０％となる。このため、第２部分は、３ビット以下であってもよい。 Further, as shown in FIG. 8, the cumulative ratio of the appearance frequency of each bit is approximately 60% for 3 bits. Therefore, the second part may be 3 bits or less.

（変形態様）
上述した実施形態では、第１塩基配列データは、複数のリファレンス配列に分解されたが、これに限られず、一の配列で表されてもよい。(Deformation mode)
In the above-described embodiment, the first base sequence data is decomposed into a plurality of reference sequences, but the present invention is not limited to this, and the first base sequence data may be represented by a single sequence.

第１部分は、２ビットであってもよい。このデータ形式においては、例えば、第１部分が００である場合、第２部分が２ビットであることを示し、第１部分が０１である場合、第２部分が６ビットであることを示し、第１部分が１０である場合、第２部分が１０ビットであることを示し、第１部分が１１である場合、第２部分が１０ビットであるとともに、後続するデータが関連するデータであることを示してもよい。 The first part may be 2 bits. In this data format, for example, when the first part is 00, it indicates that the second part is 2 bits, and when the first part is 01, it indicates that the second part is 6 bits. When the first part is 10, the second part is 10 bits, and when the first part is 11, the second part is 10 bits and the subsequent data is related data. May be shown.

また、第２部分は、関連するデータの数に応じてその長さが可変であってもよい。例えば、関連するデータの数が１である場合、第２部分が１ビットであり、関連するデータの数が２以上である場合、第２部分がそれぞれ３ビットとなるようなデータ形式であってもよい。 Further, the length of the second part may be variable depending on the number of related data. For example, if the number of related data is 1, the second part is 1 bit, and if the number of related data is 2 or more, the second part is 3 bits each. May be good.

データ作成部２１５は、このようなデータ形式に応じて、図３／ＳＴＥＰ０１６におけるデータを作成してもよい。 The data creation unit 215 may create the data in FIG. 3 / STEP016 according to such a data format.

１００‥ゲノムシーケンサー、２００‥データ作成装置、２１０‥演算処理部、２１１‥位置認識部、２１２‥並替部、２１３‥基準要素認識部、２１４‥差分認識部、２１５‥データ作成部、２２０‥記憶部、２２１‥第１塩基配列記憶部、２２２‥データ記憶部、３００‥データベース。
100 ... Genome sequencer, 200 ... Data creation device, 210 ... Arithmetic processing unit, 211 ... Position recognition unit, 212 ... Sorting unit, 213 ... Reference element recognition unit, 214 ... Difference recognition unit, 215 ... Data creation unit, 220 ... Storage unit, 221 ... First base sequence storage unit, 222 ... Data storage unit, 300 ... Database.

Claims

A first base sequence storage unit for storing first base sequence data having a length of the first base, and a first base sequence storage unit.
Based on the first base sequence data stored in the first base sequence storage unit, each second base sequence data having an individual length shorter than the first base number is the second base sequence data. A position recognition unit that recognizes a numerical value indicating the position of a partial sequence in the first base sequence data corresponding to the base sequence data,
A rearrangement unit that creates a sequence of positions by rearranging the numerical values indicating the positions of the partial sequences of the first base sequence data corresponding to each second base sequence data in ascending or descending order.
A reference element recognition unit that recognizes a reference element that is at least one element included in the array at the position, and a reference element recognition unit.
A difference recognition unit that recognizes the difference between the element and the element immediately preceding the element in the order of arrangement as the difference between adjacent elements for each element after the second in the arrangement order in the arrangement of the positions.
A data creation device including a data creation unit that creates data including a reference element recognized by the reference element recognition unit and a difference between the elements recognized by the difference recognition unit.

In the data creation device according to claim 1,
The reference element recognition unit is a data creation device characterized in that it is configured to recognize the element having the smallest value among the elements included in the position array as a reference element.

In the data creating apparatus according to claim 1 or 2.
The data creation unit includes, as data indicating the difference between the elements, a first portion indicating whether or not the preceding or succeeding data is related data and a second portion storing data of 14 bits or less. Alternatively, a data creation device characterized in that variable length data including a plurality of data is created.

In the data creation device according to claim 3,
The second part is a data creation device characterized by having 6 bits or less.

In the data creation apparatus according to claim 4,
The second part is a data creation device characterized by having 3 bits or less.

It is a method executed by a computer having a first base sequence storage unit for storing first base sequence data having a length of the first base.
Based on the first base sequence data stored in the first base sequence storage unit, each second base sequence data having an individual length shorter than the first base number is the second base sequence data. A step of recognizing a numerical value indicating the position of a partial sequence in the first base sequence data corresponding to the base sequence data, and
A step of creating a position sequence by rearranging the numerical values indicating the positions of the partial sequences of the first base sequence data corresponding to each second base sequence data in ascending or descending order.
A step of recognizing a reference element, which is at least one element contained in the array at the position,
A step of recognizing the difference between the element and the element one before the element in the order of arrangement as the difference between adjacent elements for each element after the second in the arrangement order in the arrangement of the positions.
A data creation method comprising a step of creating data including the reference element and a difference between the elements.

A computer provided with a first base sequence storage unit for storing first base sequence data having a length of the first base.
Based on the first base sequence data stored in the first base sequence storage unit, each second base sequence data having an individual length shorter than the first base number is the second base sequence data. A step of recognizing a numerical value indicating the position of a partial sequence in the first base sequence data corresponding to the base sequence data, and
A step of creating a position sequence by rearranging the numerical values indicating the positions of the partial sequences of the first base sequence data corresponding to each second base sequence data in ascending or descending order.
A step of recognizing a reference element, which is at least one element contained in the array at the position,
A step of recognizing the difference between the element and the element one before the element in the order of arrangement as the difference between adjacent elements for each element after the second in the arrangement order in the arrangement of the positions.
A storage medium containing a data creation program, which comprises executing a step of creating data including the reference element and a difference between the elements.