JP2013165661A

JP2013165661A - Method for determining base sequence of two or more specimens at a time by corresponding each specimen to sequence

Info

Publication number: JP2013165661A
Application number: JP2012030173A
Authority: JP
Inventors: Junya Yamagishi; 山岸潤也
Original assignee: Obihiro University of Agriculture and Veterinary Medicine NUC
Current assignee: Obihiro University of Agriculture and Veterinary Medicine NUC
Priority date: 2012-02-15
Filing date: 2012-02-15
Publication date: 2013-08-29

Abstract

PROBLEM TO BE SOLVED: To provide a method for determining base sequences of two or more specimens at a time by corresponding each specimen to sequences.SOLUTION: Sequences are determined at a time through correspondence to each specimen by preparing a plurality of groups by mixing two or more nucleic acid specimens in a specific combination, adding an index sequence specific to each group, subsequently carrying out a sequential analysis using the next generation sequencer (or massively parallel sequencer, NGS), corresponding to each specimen based on the combined pattern of the added index sequence in sequences whose specimen-corresponding parts are the same, and assembling the sequence groups already corresponded to the same specimen.

Description

本発明は、複数の核酸検体を特定の組み合わせで混合したサブグループを作成し、各サブグループ特有のＩｎｄｅｘ配列を付与した後に、超並列シーケンサーを用いて配列解析を行い、得られた配列をＩｎｄｅｘ配列に応じて分類したものをアセンブルすることで、各検体と配列を対応付けて決定する方法に関する。 The present invention creates a subgroup in which a plurality of nucleic acid samples are mixed in a specific combination, assigns an index sequence unique to each subgroup, performs sequence analysis using a massively parallel sequencer, and converts the obtained sequence into an index. The present invention relates to a method of associating and classifying each sample and a sequence by assembling those classified according to the sequence.

次世代シーケンサーは並列処理により一度に大量の配列を出力することが出来るが、検体を混合して反応を行うため、検体と出力配列を対応付けることが出来ない。そのため、ゲノム解析やトランスクリプトーム解析など１検体に由来するＤＮＡ、ＲＮＡの解析や、メタゲノム解析などポピュレーションを明らかにすることを目的とした解析に用いられる（図１）。 Although the next-generation sequencer can output a large amount of sequences at a time by parallel processing, the sample and the output sequence cannot be associated with each other because the sample is mixed and reacted. Therefore, it is used for analysis aiming at clarifying the population such as analysis of DNA and RNA derived from one specimen such as genome analysis and transcriptome analysis, and metagenome analysis (FIG. 1).

一方、次世代シーケンサーの解析力を有効に利用するため、検体を処理する際に数塩基から十数塩基からなる人工的な塩基配列（Ｉｎｄｅｘ）を付与することで検体と出力配列を対応付け、多検体処理を行う方法としてマルチプレックス法が知られているが、並列可能な数は数十検体にとどまる。 On the other hand, in order to effectively use the analysis power of the next-generation sequencer, when processing a sample, an artificial base sequence (Index) consisting of several bases to a dozen bases is assigned to associate the sample with the output sequence, A multiplex method is known as a method for performing multi-sample processing, but the number of parallel samples is limited to several tens of samples.

細胞工学２０１１年８月号次世代シークエンサーを使いこなす基礎の基礎Cell Engineering August 2011 Issue Mastering the Next Generation Sequencer Basics 細胞工学２０１１年８月号次世代シークエンサーを使いこなすマルチプレックス法によるバクテリアゲノムのｄｅｎｏｖｏＤＮＡ配列解読Cell Engineering August 2011 Issue Using the Next Generation Sequencer De novo DNA sequencing of bacterial genome by multiplex method 細胞工学２０１１年８月号次世代シークエンサーを使いこなすマルチプレックス法によるバクテリアゲノムのｄｅｎｏｖｏＤＮＡ配列解読Cell Engineering August 2011 Issue Using the Next Generation Sequencer De novo DNA sequencing of bacterial genome by multiplex method ＰａｒａｓｉｔｏｌＩｎｔ．２０１１Ｊｕｎ；６０（２）：１９９−２０２．Ｅｐｕｂ２０１１Ｍａｒ２１．Ｃｏｎｓｔｒｕｃｔｉｏｎａｎｄａｎａｌｙｓｉｓｏｆｆｕｌｌ−ｌｅｎｇｔｈｃＤＮＡｌｉｂｒａｒｙｏｆＣｒｙｐｔｏｓｐｏｒｉｄｉｕｍｐａｒｖｕｍ．Parasitol Int. 2011 Jun; 60 (2): 199-202. Epub 2011 Mar 21. Construction and analysis of full-length cDNA library of Cryptosporidium parvum. ＧｅｎｏｍｅＢｉｏｌ．２００９；１０（３）：Ｒ２５．Ｅｐｕｂ２００９Ｍａｒ４．Ｕｌｔｒａｆａｓｔａｎｄｍｅｍｏｒｙ−ｅｆｆｉｃｉｅｎｔａｌｉｇｎｍｅｎｔｏｆｓｈｏｒｔＤＮＡｓｅｑｕｅｎｃｅｓｔｏｔｈｅｈｕｍａｎｇｅｎｏｍｅ.Genome Biol. 2009; 10 (3): R25. Epub 2009 Mar 4. Ultrafast and memory-efficiency alignment of short DNA sequences to the human genome.

本発明の目的は、Ｉｎｄｅｘ付与方法の改良によりマルチプレックス法を超える多検体化を達成し、次世代シーケンサーの解析力を有効に利用する方法を提供することにある。 An object of the present invention is to provide a method of achieving multiple samples exceeding the multiplex method by improving the Index providing method and effectively utilizing the analysis power of the next-generation sequencer.

本発明者は、Ｉｎｄｅｘ付与方法と次世代シーケンサーから出力される配列の解析方法を鋭意検討した結果、各サンプルを特定の組み合わせで混合した後、元サンプル数の根に比例する数へ圧縮された各混合サンプルにＩｎｄｅｘを付与して次世代シーケンサーにより塩基配列を取得し、出力された配列から混合前の配列を検体のＩＤと対応付けて復元する方法を見出し、本発明を完成した（図２）。 As a result of earnest examination of the index assignment method and the analysis method of the sequence output from the next-generation sequencer, the present inventor mixed each sample in a specific combination and then compressed it to a number proportional to the root of the number of original samples. An index is assigned to each mixed sample, a base sequence is obtained by a next-generation sequencer, and a method of restoring the sequence before mixing in association with the ID of the specimen from the output sequence has been found, and the present invention has been completed (FIG. 2). ).

すなわち、本発明は、以下の態様からなる。 That is, this invention consists of the following aspects.

（１）複数からなる核酸検体（プラスミドＡ,Ｂ,Ｃ,Ｄ）のそれぞれに対して、複数のＩＤ（１,２,３,４）を同一の組み合わせにならないように付与し（Ａ=１＆２、Ｂ=１＆３,Ｃ=２＆４,Ｄ=３＆４）、同一のＩＤを有する検体を合わせたサブグループを構成し（分割・混合：１=Ａ+Ｂ,２=Ａ+Ｃ,３=Ｂ+Ｄ,４=Ｃ+Ｄ）、各サブグループを断片化し、各サブグループを構成する核酸にサブグループ特有のＩｎｄｅｘ配列（１=AA,２=CT,３=CG,４=AT、Ｉｎｄｅｘ結合）を付与した後、全てを混合して配列解析を行い（混合シーケンス）、得られた配列をＩｎｄｅｘ配列に応じて分類することで各検体との対応付けを行い（Ｉｎｄｅｘに応じてグループ化：Ａ=AA+CT,Ｂ=AA+CG,Ｃ=AT+CT,Ｄ=AT+CG）、対応付けられた配列をアセンブルすることで各検体の塩基配列を決定する方法。（括弧内は４サンプル同時解析を例示した図２の場合を示す） (1) A plurality of IDs (1, 2, 3, 4) are assigned to each of a plurality of nucleic acid samples (plasmids A, B, C, D) so as not to have the same combination (A = 1 & 2) , B = 1 & 3, C = 2 & 4, D = 3 & 4), and a sub-group is formed by combining samples having the same ID (division / mixing: 1 = A + B, 2 = A + C, 3 = B + D , 4 = C + D), each subgroup is fragmented, and a subgroup-specific index sequence (1 = AA, 2 = CT, 3 = CG, 4 = AT, Index binding) is added to the nucleic acid constituting each subgroup. After adding, all are mixed and sequence analysis is performed (mixed sequence), and the obtained sequence is classified according to the Index sequence to associate with each specimen (grouping according to Index: A = AA + CT, B = AA + CG, C = AT + CT, D = AT + CG), determine the base sequence of each sample by assembling the associated sequences How. (The figure in parentheses shows the case of FIG. 2 illustrating the simultaneous analysis of 4 samples)

（２）上記（１）における検体数が、ｎ０個の検体と、（aのｎ条−ｎ0）個のモック検体から構成されるaのｎ条個の検体である、塩基配列の決定方法。 (2) A method for determining a base sequence, wherein the number of specimens in (1) is n specimens of a composed of n0 specimens and (a article n-n0) mock specimens.

（３）上記（２）におけるサブグループの構成方法が、Ｘ番目の検体について（Ｘ−１）をa進数で表記し、各桁と各数値の組み合わせをＩＤとみなして、各桁ごとに同一の数値を有する検体をまとめて１つのサブグループとすることを、全ての桁について繰り返すことと同義の操作からなる、塩基配列の決定方法。 (3) The subgroup configuration method in (2) is the same for each digit, assuming that (X-1) is expressed in a-adic for the Xth specimen, and the combination of each digit and each numeric value is regarded as an ID. A method for determining a base sequence, comprising an operation that is synonymous with repeating for all the digits that the samples having the numerical values are collectively made into one subgroup.

（４）上記（３）におけるａが２、あるいは２の倍数である、塩基配列の決定方法。 (4) A method for determining a base sequence, wherein a in (3) is 2 or a multiple of 2.

（５）上記（４）における得られた配列をＩｎｄｅｘ配列に応じて分類することで各検体との対応付けを行う方法が、同一配列に付与された種々のＩｎｄｅｘ配列の組み合わせと、検体に付与された付与されたＩＤの組み合わせとの照合により成される、塩基配列の決定方法。 (5) The method of associating the obtained sequence in (4) with each sample by classifying according to the Index sequence is a combination of various Index sequences given to the same sequence and given to the sample A method for determining a base sequence, which is performed by collating with a combination of assigned IDs.

（６）上記（１）における核酸検体がＤＮＡもしくはＲＮＡである、塩基配列の決定方法。 (6) A method for determining a base sequence, wherein the nucleic acid sample in (1) is DNA or RNA.

（７）上記（１）における配列解析技術が次世代シーケンサーを用いたものである、塩基配列の決定方法。 (7) A method for determining a base sequence, wherein the sequence analysis technique in (1) above uses a next-generation sequencer.

（８）上記（１）におけるＩｎｄｅｘ配列がＤＮＡ、ＲＮＡもしくはその混合物質である、塩基配列の決定方法。 (8) A method for determining a base sequence, wherein the Index sequence in (1) is DNA, RNA or a mixed substance thereof.

（９）上記（８）におけるＩｎｄｅｘ配列がパリティー因子を含む、塩基配列の決定方法。 (9) A method for determining a base sequence, wherein the Index sequence in (8) includes a parity factor.

（１０）上記（１）から（５）に、上記（６）から（９）記載の方法を任意に組み合わせた、塩基配列の決定方法。 (10) A method for determining a base sequence, wherein the methods described in (6) to (9) are arbitrarily combined with the above (1) to (5).

（１１）上記（１１）記載の方法を計算機により実行する際の、コンピュータープログラム。 (11) A computer program for executing the method according to (11) above by a computer.

本発明によれば、１回の次世代シーケンサー解析で、使用するＩｎｄｅｘ数の累乗に相当する検体数の配列を決定することが可能となる（図１、図２）。 According to the present invention, it is possible to determine the sequence of the number of samples corresponding to the power of the number of indexes to be used by one-time sequencer analysis (FIGS. 1 and 2).

例えば、２０種類のＩｎｄｅｘ配列を用いることで、２の１０乗、すなわち１，０２４検体の配列を、４０種類のＩｎｄｅｘ配列を用いることで、２の２０乗、すなわち１，０４８、５７６検体の配列を、検体と出力配列の対応情報を保持した状態で、１回の次世代シーケンサー解析で決定することが出来る。 For example, by using 20 types of Index sequences, the array of 2 to the 10th power, that is, 1,024 samples, and by using 40 types of Index sequences, the sequence of 2 to the 20th power, that is, the sequence of 1,048, 576 samples. Can be determined by one-time next-generation sequencer analysis while maintaining correspondence information between the specimen and the output sequence.

具体的には、これまでキャピラリー法で行われていたプラスミドＤＮＡの塩基配列解析を、当該発明と次世代シーケンサーの組み合わせで置き換えることが可能であり、その結果、費用対効果の大幅な向上が見込まれる。 Specifically, it is possible to replace the base sequence analysis of plasmid DNA, which has been performed by the capillary method so far, with the combination of the present invention and the next-generation sequencer, and as a result, the cost-effectiveness can be greatly improved. It is.

発明の効果Effect of the invention 発明概要Summary of invention 検体混合方法の概要Overview of sample mixing method パリティーとはWhat is parity 発明概要（２）Summary of Invention (2) 実施例（１）に用いた塩基配列のアクセッション番号Accession number of the base sequence used in Example (1) 実施例（１）で用いた混合グループの構成Composition of mixed group used in Example (1) 実施例（１）復元率Example (1) Restoration rate 実施例（１）復元率Example (1) Restoration rate

以下、本発明の実施の形態について詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail.

本発明において次世代シーケンサーとは、近年開発された、並列処理によって飛躍に解析能力を向上させた一群の塩基配列解析装置で、具体的には、Ｉｌｌｕｍｉｎａ社ＨｉＳｅｑ２０００、Ｒｏｃｈｅ社ＧｅｎｏｍｅＳｅｑｕｅｎｃｅｒ、ライフテクノロジー社ＩｏｎＰＧＭ、ＰａｃｉｆｉｃＢｉｏｓｃｉｅｎｃｅｓ等を指す（非特許文献１）が、今後開発される装置も含み、ここに示したもの限定されるものではない。 In the present invention, the next-generation sequencer is a group of base sequence analyzers that have been developed in recent years, and whose analysis capabilities have been dramatically improved by parallel processing. Although it refers to Ion PGM, Pacific Biosciences, etc. (Non-patent Document 1), it includes devices to be developed in the future and is not limited to those shown here.

本発明において核酸検体とは、同一配列からなるＤＮＡあるいはＲＮＡ分子の集合体である。すなわち、大腸菌を用いた系で増幅およびクローニングされたプラスミドが当てはまる。核酸検体はＰＣＲ産物でもよい。また上記の同一配列とは、完全に一致した配列だけではなく、部分的に同一の配列も含む分子の集合体も含む。さらに、一定量のコンタミネーションを含んでいてもよい。 In the present invention, a nucleic acid sample is an assembly of DNA or RNA molecules having the same sequence. That is, a plasmid amplified and cloned in a system using E. coli is applicable. The nucleic acid sample may be a PCR product. The above-mentioned identical sequence includes not only a completely matched sequence but also a collection of molecules including partially identical sequences. Furthermore, a certain amount of contamination may be included.

核酸検体の混合方法について、図３と以下に例示するが、ここに示すもの限定されるものではない。 The method for mixing nucleic acid samples is illustrated in FIG. 3 and the following, but is not limited thereto.

検体数をｎ0とした場合、ａを整数の定数として、aのｎ乗>ｎ0となる最少のｎを求め、（aのｎ乗−ｎ0）個のモック検体を加えることで０番から（aのｎ乗−１）番からなるaのｎ乗個の検体を準備し、Ｘ番目の検体について（Ｘ−１）をa進数で表記し、各桁と各数値の組み合わせをＩＤとみなして、各桁ごとに同一の数値を有する検体をまとめて１つのサブグループとして混合する。 When the number of specimens is n0, a is an integer constant, and the minimum number n of a to the power of n> n0 is obtained. By adding (a to the power of n−n0) mock specimens, N-th sample of the n-th power of 1) is prepared, (X-1) is expressed in a-adic number for the X-th sample, and a combination of each digit and each numerical value is regarded as an ID, Samples having the same numerical value for each digit are mixed together as one subgroup.

以下にａ＝２の場合における具体例を表１と共に示すが、混合方法は例示した方法に限定されるものではない。 Although the specific example in the case of a = 2 is shown with Table 1 below, the mixing method is not limited to the illustrated method.

＃０から＃２のｎ乗‐１までの２のｎ乗種類からなる核酸サンプルを並列解析する場合、以下のルールに基づいて検体の混合を行う。 When nucleic acid samples consisting of 2 n power types from # 0 to # 2 n-1 are analyzed in parallel, specimens are mixed based on the following rules.

検体番号を２進数で表記し、ｋ桁目の数値が０ならば混合グループＩＤ（２×（ｎ−ｋ）＋１）、１ならば混合グループＩＤ（２×（ｎ−ｋ）＋２）を付与することを、１からｎを満たす全てのｋについて繰り返し行う。 Specimen number is expressed in binary number. If the numerical value of the k-th digit is 0, a mixed group ID (2 × (n−k) +1) is given, and if it is 1, a mixed group ID (2 × (n−k) +2) is given. This is repeated for all k satisfying 1 to n.

例えば、ｎ＝３の時、総検体数は８で、検体番号＃６の検体は、２進法で１１０と表記され、３桁目の数は１なので、（２×（３−３）＋２）＝２、２桁目の数は１なので、（２×（３−２）＋２）＝４、１桁目の数は０なので、（２×（３−１）＋１）＝５となり、結果、検体番号＃６には、混合グループＩＤ２，４，５が付加されることになる。逆に、ＩＤ２を付加される検体は、検体＃６（１１０）、検体＃７（１１１）、検体＃５（１０１）、検体＃４（１００）なので、これらをまとめた混合検体を作成する。 For example, when n = 3, the total number of samples is 8, the sample number # 6 is expressed as 110 in binary, and the third digit is 1, so (2 × (3-3) +2 ) = 2, since the number of the second digit is 1, (2 × (3-2) +2) = 4, since the number of the first digit is 0, (2 × (3-1) +1) = 5, and the result The sample numbers # 6 are added with the mixed group IDs 2, 4, and 5. On the other hand, the samples to which ID2 is added are sample # 6 (110), sample # 7 (111), sample # 5 (101), and sample # 4 (100), so a mixed sample in which these are collected is created.

２ｎ種類からなる核酸検体を上記のルールによって２ｎ種類の混合検体に編成したものを、次世代シーケンサーを用いた既知マルチプレックス法（非特許文献２）に供し、塩基配列を取得する。 A nucleic acid sample composed of 2n types is organized into 2n types of mixed samples according to the above rules, and then subjected to a known multiplex method using a next-generation sequencer (Non-patent Document 2) to obtain a base sequence.

この時に用いるＩｎｄｅｘにはパリティー塩基を付与することが出来る（図４）。 A parity base can be added to the Index used at this time (FIG. 4).

塩基配列の再構成を以下の様に行う（図２、図５）。 The base sequence is reconstructed as follows (FIGS. 2 and 5).

次世代シーケンサーから出力された配列にはＩｎｄｅｘが付与されており、Ｉｎｄｅｘに基づいて配列と混合ＩＤを対応付けすることができる。 An index is assigned to the sequence output from the next-generation sequencer, and the sequence and the mixed ID can be associated with each other based on the index.

２ｎ種類からなる核酸検体を２ｎ種類の混合検体に編成した後に並列解析する場合、例えば、混合ＩＤ０，２，４，・・・，２ｎ−２には検体番号＃０が含まれている。同様に、混合ＩＤ０，２，４，・・・，２ｎ−１には検体番号＃１が含まれている。 When parallel analysis is performed after 2n types of nucleic acid samples are organized into 2n types of mixed samples, for example, sample numbers # 0 are included in the mixed IDs 0, 2, 4,..., 2n-2. Similarly, the sample numbers # 1 are included in the mixed IDs 0, 2, 4,..., 2n-1.

すなわち、検体番号＃０由来の配列には、混合ＩＤ０，２，４，・・・，２ｎ−２に相当するＩｎｄｅｘが付与されている。同様に、検体番号＃１由来の配列には、混合ＩＤ０，２，４，・・・，２ｎ−１に相当するＩｎｄｅｘが付与されている。 That is, the index corresponding to the mixed IDs 0, 2, 4,..., 2n-2 is assigned to the sequence derived from the specimen number # 0. Similarly, an index corresponding to the mixed IDs 0, 2, 4,..., 2n-1 is assigned to the sequence derived from the specimen number # 1.

逆に、同一の配列に付与されていたＩｎｄｅｘを集計し、その分布を解析することで、元々の検体番号を特定することができる。 On the contrary, the original specimen number can be specified by counting the Indexes assigned to the same sequence and analyzing the distribution.

例えば、次世代シーケンサーから出力されたある塩基配列に混合ＩＤ１，３，５，・・・，２ｎ−２に相当するＩｎｄｅｘが付与されていた場合、その配列は検体番号＃０を構成する配列であることが解る。同様に、ある配列に混合ＩＤ２，４，５，・・・，２ｎ−１に相当するＩｎｄｅｘが付与されていた場合、その配列は検体番号＃１を構成する配列であることが解る。 For example, when an index corresponding to the mixed ID 1, 3, 5,..., 2n-2 is assigned to a certain base sequence output from the next-generation sequencer, the sequence is a sequence constituting the sample number # 0. I understand that there is. Similarly, when an index corresponding to the mixed IDs 2, 4, 5,..., 2n-1 is assigned to a certain sequence, it is understood that the sequence is a sequence constituting the specimen number # 1.

元々の検体番号に対応付けられた次世代シーケンサーからの出力配列を基にアセンブラーでコンティグを作成することで、各サンプルの塩基配列を得ることができる。ここでアセンブラーはＡＢｙＳＳ（非特許文献３）あるいはＶｅｌｖｅｔ等を用いてもよいが、それらに限らない。 The base sequence of each sample can be obtained by creating a contig with an assembler based on the output sequence from the next-generation sequencer associated with the original specimen number. Here, the assembler may use ABySS (Non-Patent Document 3) or Velvet, but is not limited thereto.

次世代シーケンサーからの出力配列は適宜短縮して上記の解析に供しても良い。 The output sequence from the next-generation sequencer may be appropriately shortened and used for the above analysis.

以下、実施例を持って本発明の実施の態様を説明するが、これは単なる例示であり本発明を何等制限するものではない。 Hereinafter, the embodiments of the present invention will be described with reference to examples. However, this is merely an example and does not limit the present invention.

（実施例１）計算機による模擬解析
１，０６６配列からなるクリプトスポリジウムパルバムの既知完全長ｃＤＮＡ配列を模擬解析の対象配列とした。この配列は、サンガー法で明らかにされた配列断片、および、次世代シーケンサーを用いたショットガン解析により明らかにされた１８，３０８，２５０個の配列断片（アクセッション番号：ＳＲＸ００４５３６〜ＳＲＸ００４５３８）を、既知のクリプトスポリジウムパルバムのゲノム配列と比較することで決定されたものである（非特許文献４）。 (Example 1) Simulated analysis by computer A known full-length cDNA sequence of Cryptosporidium parvum consisting of 1,066 sequences was used as a target sequence for simulation analysis. This sequence includes sequence fragments revealed by the Sanger method and 18,308,250 sequence fragments (accession numbers: SRX004536 to SRX004538) revealed by shotgun analysis using a next-generation sequencer. It was determined by comparing with the genome sequence of known Cryptosporidium parvum (Non-patent Document 4).

本実施例では、請求項２に記載のａを２に、ｎを８とし、上記１，０６６配列の中から２５６配列を選択した（図６）。また、上記１８，３０８，２５０断片配列の中から２５６配列に相当する５，４７４，１６４断片配列を選抜し、以降に用いた。 In this example, a in claim 2 was set to 2, n was set to 8, and 256 sequences were selected from the 1,066 sequences (FIG. 6). In addition, a 5,474,164 fragment sequence corresponding to 256 sequences was selected from the 18,308,250 fragment sequences and used thereafter.

本実施例では、当該発明に基づいた処理により、上記５，４７４，１６４断片配列から２５６配列を再構成することを、計算機を用いた模擬解析により試みた。 In this example, it was attempted by simulation analysis using a computer to reconstruct 256 sequences from the 5,474,164 fragment sequences by the processing based on the present invention.

対象となる２５６配列を図７に示す組み合わせで混合することで、１６グループを構成した。すなわち各配列は８種類のグループに属することになる。 Sixteen groups were constructed by mixing the 256 sequences of interest in the combinations shown in FIG. That is, each array belongs to 8 types of groups.

マッピングツールであるＢｏｗｔｉｅ（非特許文献５）を用いることで、上記５，４７４，１６４断片配列を対象となる２５６配列に対応付け、さらにこれらを上記の８グループにランダムに割り振った。これは模擬解析に特有の操作であり、実解析ではＩｎｄｅｘ配列に基づく処理に置き換えられる。また、５’側の４塩基をＩｎｄｅｘ相当配列として除去し、残った３２塩基を続く解析に用いた。 By using Bowtie (Non-Patent Document 5) which is a mapping tool, the 5,474,164 fragment sequences were associated with 256 sequences of interest, and these were randomly assigned to the 8 groups. This is an operation peculiar to the simulation analysis, and is replaced with a process based on the Index array in the actual analysis. In addition, 4 bases on the 5 'side were removed as an Index-corresponding sequence, and the remaining 32 bases were used for the subsequent analysis.

解析対象の３２塩基を連続する２２塩基を単位として抽出することで、１配列から１１配列のサブセットを作成した。 Subsets of 11 sequences were created from 1 sequence by extracting the 32 bases to be analyzed in units of 22 consecutive bases.

同一配列を有する２２塩基の配列を集計し、それらに付与されていたＩｎｄｅｘ配列のパターンを解析することで、その２２塩基が２５６配列のうちのどの配列に由来するものかを推定した。 By counting 22 base sequences having the same sequence and analyzing the index sequence pattern assigned to them, it was estimated which sequence of the 256 bases was derived from the 22 bases.

２５６配列のそれぞれについて、対応付けられた２２塩基の配列を集計し、アセンブラーであるＡＢｙＳＳ（非特許文献３）によりコンティグの作成を行った。 For each of the 256 sequences, the 22-base sequence corresponding to each of the 256 sequences was tabulated, and a contig was created by the assembler ABySS (Non-patent Document 3).

作成されたコンティグ配列を元の配列を比較することにより、復元度を見積もった。 The degree of restoration was estimated by comparing the generated contig sequence with the original sequence.

ここで復元度は、（最大長コンティグの長さ／元の配列の長さ）で定義する。 Here, the degree of restoration is defined by (length of maximum length contig / length of original array).

その結果、当該方法により大部分の配列が復元できることが示された。すなわち、２５６配列のうち、９９%以上の領域を復元できたものが１０４配列、９５%以上の領域を復元できたものが１６０配列、９０%以上の領域を復元できたものが１９９配列となる成績が得られた。詳細を図８および図９に示した。 As a result, it was shown that most of the sequences can be restored by this method. That is, of the 256 sequences, 104 sequences can restore 99% or more of the region, 160 sequences can restore 95% or more of the region, and 199 sequences can restore 90% or more of the region. A grade was obtained. Details are shown in FIGS.

これまでキャピラリー法で行われていたプラスミドＤＮＡの塩基配列解析を、当該発明と次世代シーケンサーの組み合わせで置き換えることが可能であり、その結果、費用対効果の大幅な向上が見込まれる。 It is possible to replace the base sequence analysis of plasmid DNA, which has been performed by the capillary method so far, with a combination of the present invention and a next-generation sequencer, and as a result, a significant improvement in cost effectiveness can be expected.

Claims

A plurality of IDs (1, 2, 3, 4) are assigned to each of a plurality of nucleic acid samples (plasmids A, B, C, D) so as not to have the same combination (A = 1 & 2, B = 1 & 3, C = 2 & 4, D = 3 & 4), and a sub-group is formed by combining samples having the same ID (division / mixing: 1 = A + B, 2 = A + C, 3 = B + D, 4 = C + D), each subgroup is fragmented, and subsequence-specific index sequences (1 = AA, 2 = CT, 3 = CG, 4 = AT, index binding) are added to the nucleic acids constituting each subgroup. All are mixed and sequence analysis is performed (mixed sequence), and the obtained sequences are classified according to the index sequence to associate with each specimen (grouping according to the index: A = AA + CT , B = AA + CG, C = AT + CT, D = AT + CG), determining the base sequence of each specimen by assembling the associated sequences . (The figure in parentheses shows the case of FIG. 2 illustrating the simultaneous analysis of 4 samples)

The base sequence determination method according to claim 1, wherein the number of samples is n0 samples and (n-th power of n-n0) mock samples.

The subgroup configuration method according to claim 2 has (X-1) expressed in a-adic for the X-th specimen, and a combination of each digit and each numerical value is regarded as an ID, and each digit has the same numerical value. A method for determining a base sequence, comprising an operation synonymous with repeating all the digits to group samples into one subgroup.

The method for determining a base sequence, wherein a in claim 3 is 2 or a multiple of 2.

The method of associating each sequence obtained by classifying the obtained sequence according to claim 1 with each sample is a combination of various index sequences assigned to the same sequence and the ID assigned to the sample. A method for determining a base sequence, which is performed by matching with a combination.

The method for determining a base sequence, wherein the nucleic acid sample according to claim 1 is DNA or RNA.

A method for determining a base sequence, wherein the sequence analysis technique according to claim 1 uses a next-generation sequencer.

The method for determining a base sequence, wherein the Index sequence according to claim 1 is DNA, RNA or a mixed substance thereof.

The method for determining a base sequence, wherein the Index sequence according to claim 8 includes a parity factor.

A method for determining a base sequence, wherein the methods according to claims 6 to 9 are arbitrarily combined with claims 1 to 5.

A computer program for executing the method according to claim 10 by a computer.